David Durán Prieto Gerardo Adrián Aguirre Vivar Ana Jiménez Santamaría
El data set que ha sido elegido proviene de una encuesta realizada por la PSA (Philippine Statistics Authority) donde se recogen los gastos e ingresos por familia en las Islas Filipinas. Contiene más de 40000 observaciones y 60 variables, que han sido agrupadas en las siguientes categorías:
Durante varios años, identificar un modelo de clasificación socio-económico óptimo en Filipinas ha sido un tema difícil de abordar. A día de hoy, ningun modelo ha sido aceptado de forma global, y los diferentes organismos gubernamentales que existen utilizan sus propios modelos. Por ello, el presente trabajo se plantea un objetivo: diseñar un modelo que consiga abordar el problema y resolverlo de manera eficaz.
Objetivo: Predecir los ingresos de una familia filipina, basándse en los datos disponibles. Pregunta: A partir de un modelo de regresión lineal múltiple, ¿qué variables son las más adecuadas para predecir los ingresos? Target: La variable respuesta es el total de ingresos de cada familia filipina (Total.Household.Income)
El análisis de dividirá en dos fases:
La primera fase consistirá en un análisis exploratorio de los datos para entender mejor el significado y la relevancia de cada una de las variables. Se estudiarán puntos clave como el nivel de correlación entre la variable de interés y las demás. Por ello, para cada variable estudiada, se planteará:
La segunda fase consistirá en la elaboración de un modelo de regresión lineal múltiple con las variables predictoras seleccionadas.
Antes de proceder con la visualización gráfica de las variables (para tener un visión de la distribución de nuestros datos), será realizado un preprocesamiento y limpieza del conjunto de datos. Serán etiquetados como NA aquellos valores que así deban considerarse; se eliminarán ciertas variables por no presentar interés para el objetivo planteado, y por último, seran preparados los conjuntos de test/validación y de train. Este último será el que sirva para entrenar el modelo de predicción, que será después evaluado con el conjunto de test/validación.
# ----- Se cargan las librerías que serán necesarias ------
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(GGally)
library(gridExtra)
library(egg)
library(VIM)
library(vcd)
library(Hmisc)
library(readr)
library(moments)
library(caret)
library(gmodels)
library(reshape)
library(ggcorrplot)
A continuación, se realizará un resumen de los estadísticos principales de las variables numéricas para ver su media, desviación típica, número total de muestras y valores faltanes en cada variable. Curiosamente, solo se encuentran datos faltantes en las variables categóricas, que más adelante se tratarán.
# ----- Carga de datos -----
datos<-read.csv('Family_Income_and_Expenditure.csv',stringsAsFactors = TRUE)
datos_occupation <- datos
# ----- Resumen numérico de las variables -----
summary(datos)
## Total.Household.Income Region Total.Food.Expenditure
## Min. : 11285 IVA - CALABARZON : 4162 Min. : 2947
## 1st Qu.: 104895 NCR : 4130 1st Qu.: 51017
## Median : 164080 III - Central Luzon : 3237 Median : 72986
## Mean : 247556 VI - Western Visayas : 2851 Mean : 85099
## 3rd Qu.: 291138 VII - Central Visayas: 2541 3rd Qu.:105636
## Max. :11815988 V - Bicol Region : 2472 Max. :827565
## (Other) :22151
## Main.Source.of.Income Agricultural.Household.indicator
## Enterpreneurial Activities:10320 Min. :0.0000
## Other sources of Income :10836 1st Qu.:0.0000
## Wage/Salaries :20388 Median :0.0000
## Mean :0.4299
## 3rd Qu.:1.0000
## Max. :2.0000
##
## Bread.and.Cereals.Expenditure Total.Rice.Expenditure Meat.Expenditure
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 16556 1st Qu.: 11020 1st Qu.: 3354
## Median : 23324 Median : 16620 Median : 7332
## Mean : 25134 Mean : 18196 Mean : 10540
## 3rd Qu.: 31439 3rd Qu.: 23920 3rd Qu.: 14292
## Max. :765864 Max. :758326 Max. :261566
##
## Total.Fish.and..marine.products.Expenditure Fruit.Expenditure
## Min. : 0 Min. : 0
## 1st Qu.: 5504 1st Qu.: 1025
## Median : 8695 Median : 1820
## Mean : 10529 Mean : 2550
## 3rd Qu.: 13388 3rd Qu.: 3100
## Max. :188208 Max. :273769
##
## Vegetables.Expenditure Restaurant.and.hotels.Expenditure
## Min. : 0 Min. : 0
## 1st Qu.: 2873 1st Qu.: 1930
## Median : 4314 Median : 7314
## Mean : 5007 Mean : 15437
## 3rd Qu.: 6304 3rd Qu.: 19921
## Max. :74800 Max. :725296
##
## Alcoholic.Beverages.Expenditure Tobacco.Expenditure
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 270 Median : 300
## Mean : 1085 Mean : 2295
## 3rd Qu.: 1299 3rd Qu.: 3146
## Max. :59592 Max. :139370
##
## Clothing..Footwear.and.Other.Wear.Expenditure Housing.and.water.Expenditure
## Min. : 0 Min. : 1950
## 1st Qu.: 1365 1st Qu.: 13080
## Median : 2740 Median : 22992
## Mean : 4955 Mean : 38376
## 3rd Qu.: 5580 3rd Qu.: 45948
## Max. :356750 Max. :2188560
##
## Imputed.House.Rental.Value Medical.Care.Expenditure Transportation.Expenditure
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 6000 1st Qu.: 300 1st Qu.: 2412
## Median : 10800 Median : 1125 Median : 6036
## Mean : 20922 Mean : 7160 Mean : 11806
## 3rd Qu.: 24000 3rd Qu.: 4680 3rd Qu.: 13776
## Max. :1920000 Max. :1049275 Max. :834996
##
## Communication.Expenditure Education.Expenditure
## Min. : 0 Min. : 0
## 1st Qu.: 564 1st Qu.: 0
## Median : 1506 Median : 880
## Mean : 4095 Mean : 7474
## 3rd Qu.: 3900 3rd Qu.: 4060
## Max. :149940 Max. :731000
##
## Miscellaneous.Goods.and.Services.Expenditure Special.Occasions.Expenditure
## Min. : 0 Min. : 0
## 1st Qu.: 3792 1st Qu.: 0
## Median : 6804 Median : 1500
## Mean : 12522 Mean : 5266
## 3rd Qu.: 14154 3rd Qu.: 5000
## Max. :553560 Max. :556700
##
## Crop.Farming.and.Gardening.expenses
## Min. : 0
## 1st Qu.: 0
## Median : 0
## Mean : 13817
## 3rd Qu.: 6313
## Max. :3729973
##
## Total.Income.from.Entrepreneurial.Acitivites Household.Head.Sex
## Min. : 0 Female: 9061
## 1st Qu.: 0 Male :32483
## Median : 19222
## Mean : 54376
## 3rd Qu.: 65969
## Max. :9234485
##
## Household.Head.Age Household.Head.Marital.Status
## Min. : 9.00 Annulled : 11
## 1st Qu.:41.00 Divorced/Separated: 1425
## Median :51.00 Married :31347
## Mean :51.38 Single : 1942
## 3rd Qu.:61.00 Unknown : 1
## Max. :99.00 Widowed : 6818
##
## Household.Head.Highest.Grade.Completed
## High School Graduate : 9628
## Elementary Graduate : 7640
## Grade 4 : 2282
## Grade 5 : 2123
## Second Year High School: 2104
## Grade 3 : 1994
## (Other) :15773
## Household.Head.Job.or.Business.Indicator
## No Job/Business : 7536
## With Job/Business:34008
##
##
##
##
##
## Household.Head.Occupation
## Farmhands and laborers : 3478
## Rice farmers : 2849
## General managers/managing proprietors in wholesale and retail trade : 2028
## General managers/managing proprietors in transportation, storage and communications: 1932
## Corn farmers : 1724
## (Other) :21997
## NA's : 7536
## Household.Head.Class.of.Worker
## Self-employed wihout any employee :13766
## Worked for private establishment :13731
## Worked for government/government corporation : 2820
## Employer in own family-operated farm or business: 2581
## Worked for private household : 811
## (Other) : 299
## NA's : 7536
## Type.of.Household Total.Number.of.Family.members
## Extended Family :12932 Min. : 1.000
## Single Family :28445 1st Qu.: 3.000
## Two or More Nonrelated Persons/Members: 167 Median : 4.000
## Mean : 4.635
## 3rd Qu.: 6.000
## Max. :26.000
##
## Members.with.age.less.than.5.year.old Members.with.age.5...17.years.old
## Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :1.000
## Mean :0.4102 Mean :1.363
## 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :5.0000 Max. :8.000
##
## Total.number.of.family.members.employed
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.273
## 3rd Qu.:2.000
## Max. :8.000
##
## Type.of.Building.House
## Commercial/industrial/agricultural building: 51
## Duplex : 1084
## Institutional living quarter : 9
## Multi-unit residential : 1329
## Other building unit (e.g. cave, boat) : 2
## Single house :39069
##
## Type.of.Roof
## Light material (cogon,nipa,anahaw) : 5074
## Mixed but predominantly light materials : 846
## Mixed but predominantly salvaged materials : 56
## Mixed but predominantly strong materials : 2002
## Not Applicable : 12
## Salvaged/makeshift materials : 212
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos):33342
## Type.of.Walls House.Floor.Area House.Age Number.of.bedrooms
## Light : 8267 Min. : 5.0 Min. : 0.00 Min. :0.000
## NOt applicable: 12 1st Qu.: 25.0 1st Qu.: 10.00 1st Qu.:1.000
## Quite Strong : 3487 Median : 40.0 Median : 17.00 Median :2.000
## Salvaged : 456 Mean : 55.6 Mean : 20.13 Mean :1.788
## Strong :27739 3rd Qu.: 70.0 3rd Qu.: 26.00 3rd Qu.:2.000
## Very Light : 1583 Max. :998.0 Max. :200.00 Max. :9.000
##
## Tenure.Status
## Own or owner-like possession of house and lot :29541
## Own house, rent-free lot with consent of owner : 6165
## Rent house/room including lot : 2203
## Rent-free house and lot with consent of owner : 2014
## Own house, rent-free lot without consent of owner: 995
## Own house, rent lot : 425
## (Other) : 201
## Toilet.Facilities
## Water-sealed, sewer septic tank, used exclusively by household:29162
## Water-sealed, sewer septic tank, shared with other household : 3694
## Water-sealed, other depository, used exclusively by household : 2343
## Closed pit : 2273
## None : 1580
## Open pit : 1189
## (Other) : 1303
## Electricity Main.Source.of.Water.Supply
## Min. :0.0000 Own use, faucet, community water system:16093
## 1st Qu.:1.0000 Shared, tubed/piped deep well : 6242
## Median :1.0000 Shared, faucet, community water system : 4614
## Mean :0.8908 Own use, tubed/piped deep well : 4587
## 3rd Qu.:1.0000 Dug well : 3876
## Max. :1.0000 Protected spring, river, stream, etc : 2657
## (Other) : 3475
## Number.of.Television Number.of.CD.VCD.DVD Number.of.Component.Stereo.set
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.8569 Mean :0.4352 Mean :0.1621
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :6.0000 Max. :5.0000 Max. :5.0000
##
## Number.of.Refrigerator.Freezer Number.of.Washing.Machine
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3942 Mean :0.3198
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :5.0000 Max. :3.0000
##
## Number.of.Airconditioner Number.of.Car..Jeep..Van
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.1298 Mean :0.08121
## 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :5.0000 Max. :5.00000
##
## Number.of.Landline.wireless.telephones Number.of.Cellular.phone
## Min. :0.00000 Min. : 0.000
## 1st Qu.:0.00000 1st Qu.: 1.000
## Median :0.00000 Median : 2.000
## Mean :0.06061 Mean : 1.906
## 3rd Qu.:0.00000 3rd Qu.: 3.000
## Max. :4.00000 Max. :10.000
##
## Number.of.Personal.Computer Number.of.Stove.with.Oven.Gas.Range
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.315 Mean :0.135
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :6.000 Max. :3.000
##
## Number.of.Motorized.Banca Number.of.Motorcycle.Tricycle
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.01312 Mean :0.2899
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :3.00000 Max. :5.0000
##
# ----- Datos faltantes en el dataset -----
describe(datos)
## datos
##
## 60 Variables 41544 Observations
## --------------------------------------------------------------------------------
## Total.Household.Income
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 38670 1 247556 219756 56072 71596
## .25 .50 .75 .90 .95
## 104895 164080 291138 502021 692298
##
## lowest : 11285 11988 12039 12141 12911
## highest: 6452314 7082152 9952913 11639365 11815988
## --------------------------------------------------------------------------------
## Region
## n missing distinct
## 41544 0 17
##
## lowest : ARMM CAR Caraga I - Ilocos Region II - Cagayan Valley
## highest: VII - Central Visayas VIII - Eastern Visayas X - Northern Mindanao XI - Davao Region XII - SOCCSKSARGEN
## --------------------------------------------------------------------------------
## Total.Food.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 35776 1 85099 52059 27956 35654
## .25 .50 .75 .90 .95
## 51017 72986 105636 148255 181991
##
## lowest : 2947 3704 5408 5482 5638, highest: 691917 720007 729606 791848 827565
## --------------------------------------------------------------------------------
## Main.Source.of.Income
## n missing distinct
## 41544 0 3
##
## Value Enterpreneurial Activities Other sources of Income
## Frequency 10320 10836
## Proportion 0.248 0.261
##
## Value Wage/Salaries
## Frequency 20388
## Proportion 0.491
## --------------------------------------------------------------------------------
## Agricultural.Household.indicator
## n missing distinct Info Mean Gmd
## 41544 0 3 0.679 0.4299 0.6278
##
## Value 0 1 2
## Frequency 28106 9018 4420
## Proportion 0.677 0.217 0.106
## --------------------------------------------------------------------------------
## Bread.and.Cereals.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 26082 1 25134 13311 8552 11487
## .25 .50 .75 .90 .95
## 16556 23324 31439 40385 46887
##
## lowest : 0 25 31 32 42, highest: 270612 338818 345643 437467 765864
## --------------------------------------------------------------------------------
## Total.Rice.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 16145 1 18196 11582 3237 6188
## .25 .50 .75 .90 .95
## 11020 16620 23920 31481 36940
##
## lowest : 0 1 2 8 10, highest: 189906 206702 343907 429640 758326
## --------------------------------------------------------------------------------
## Meat.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 18619 1 10540 10234 890 1510
## .25 .50 .75 .90 .95
## 3354 7332 14292 23697 30951
##
## lowest : 0 16 18 22 25, highest: 114504 119230 132142 140992 261566
## --------------------------------------------------------------------------------
## Total.Fish.and..marine.products.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 18014 1 10529 7604 2438 3461
## .25 .50 .75 .90 .95
## 5504 8695 13388 19431 24490
##
## lowest : 0 10 26 36 40, highest: 98288 113749 119640 125802 188208
## --------------------------------------------------------------------------------
## Fruit.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 7140 1 2550 2308 390 583
## .25 .50 .75 .90 .95
## 1025 1820 3100 5190 7120
##
## lowest : 0 4 5 10 12, highest: 47042 48980 69319 82600 273769
## --------------------------------------------------------------------------------
## Vegetables.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 10599 1 5007 3296 1330 1861
## .25 .50 .75 .90 .95
## 2873 4314 6304 8854 10886
##
## lowest : 0 6 25 30 33, highest: 49000 49810 52401 55230 74800
## --------------------------------------------------------------------------------
## Restaurant.and.hotels.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 12367 0.999 15437 19509 0 120
## .25 .50 .75 .90 .95
## 1930 7314 19921 39629 57064
##
## lowest : 0 1 3 4 10, highest: 519820 523230 597150 625200 725296
## --------------------------------------------------------------------------------
## Alcoholic.Beverages.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 4084 0.934 1085 1610 0 0
## .25 .50 .75 .90 .95
## 0 270 1299 3000 4602
##
## lowest : 0 5 9 10 12, highest: 44400 44704 46950 51688 59592
## --------------------------------------------------------------------------------
## Tobacco.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 3118 0.897 2295 3396 0 0
## .25 .50 .75 .90 .95
## 0 300 3146 7240 10498
##
## lowest : 0 2 3 4 5, highest: 56380 61359 73881 97740 139370
## --------------------------------------------------------------------------------
## Clothing..Footwear.and.Other.Wear.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 9819 1 4955 5561 350 650
## .25 .50 .75 .90 .95
## 1365 2740 5580 11126 16806
##
## lowest : 0 12 20 25 30, highest: 174242 191756 212925 217500 356750
## --------------------------------------------------------------------------------
## Housing.and.water.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 13243 1 38375 38124 7020 8832
## .25 .50 .75 .90 .95
## 13080 22992 45948 80520 114210
##
## lowest : 1950 1980 2100 2112 2118
## highest: 1403310 1458300 1468476 1663812 2188560
## --------------------------------------------------------------------------------
## Imputed.House.Rental.Value
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 266 0.998 20922 24331 1200 3000
## .25 .50 .75 .90 .95
## 6000 10800 24000 48000 66000
##
## lowest : 0 600 720 900 960
## highest: 1020000 1080000 1200000 1500000 1920000
## --------------------------------------------------------------------------------
## Medical.Care.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 11887 1 7160 11638 30 92
## .25 .50 .75 .90 .95
## 300 1125 4680 15287 30005
##
## lowest : 0 5 6 7 8
## highest: 767726 900279 973700 1038512 1049275
## --------------------------------------------------------------------------------
## Transportation.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 7435 1 11806 14087 600 1026
## .25 .50 .75 .90 .95
## 2412 6036 13776 27492 41026
##
## lowest : 0 12 18 24 30, highest: 481098 530322 539004 601890 834996
## --------------------------------------------------------------------------------
## Communication.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 3826 0.999 4095 5584 0 0
## .25 .50 .75 .90 .95
## 564 1506 3900 11280 18720
##
## lowest : 0 12 18 24 30, highest: 101982 110160 111360 112500 149940
## --------------------------------------------------------------------------------
## Education.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 6893 0.974 7474 12380 0 0
## .25 .50 .75 .90 .95
## 0 880 4060 21350 38750
##
## lowest : 0 5 10 12 15, highest: 498178 502600 669400 700000 731000
## --------------------------------------------------------------------------------
## Miscellaneous.Goods.and.Services.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 7669 1 12522 13527 1566 2232
## .25 .50 .75 .90 .95
## 3792 6804 14154 28816 41795
##
## lowest : 0 18 60 78 90, highest: 365484 368628 437424 447318 553560
## --------------------------------------------------------------------------------
## Special.Occasions.Expenditure
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 3412 0.968 5266 7949 0 0
## .25 .50 .75 .90 .95
## 0 1500 5000 12750 21697
##
## lowest : 0 4 8 10 15, highest: 277860 290000 300000 340000 556700
## --------------------------------------------------------------------------------
## Crop.Farming.and.Gardening.expenses
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 9961 0.644 13817 24027 0 0
## .25 .50 .75 .90 .95
## 0 0 6313 45113 78205
##
## lowest : 0 10 20 25 30
## highest: 1331340 1370800 1779690 2823280 3729973
## --------------------------------------------------------------------------------
## Total.Income.from.Entrepreneurial.Acitivites
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 20204 0.957 54376 78766 0 0
## .25 .50 .75 .90 .95
## 0 19222 65969 126924 191197
##
## lowest : 0 16 20 26 45
## highest: 5107451 5749030 5790000 6576302 9234485
## --------------------------------------------------------------------------------
## Household.Head.Sex
## n missing distinct
## 41544 0 2
##
## Value Female Male
## Frequency 9061 32483
## Proportion 0.218 0.782
## --------------------------------------------------------------------------------
## Household.Head.Age
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 89 1 51.38 16.11 29 33
## .25 .50 .75 .90 .95
## 41 51 61 71 76
##
## lowest : 9 10 13 14 15, highest: 95 96 97 98 99
## --------------------------------------------------------------------------------
## Household.Head.Marital.Status
## n missing distinct
## 41544 0 6
##
## lowest : Annulled Divorced/Separated Married Single Unknown
## highest: Divorced/Separated Married Single Unknown Widowed
##
## Value Annulled Divorced/Separated Married
## Frequency 11 1425 31347
## Proportion 0.000 0.034 0.755
##
## Value Single Unknown Widowed
## Frequency 1942 1 6818
## Proportion 0.047 0.000 0.164
## --------------------------------------------------------------------------------
## Household.Head.Highest.Grade.Completed
## n missing distinct
## 41544 0 46
##
## lowest : Agriculture, Forestry, and Fishery Programs Architecture and Building Programs Arts Programs Basic Programs Business and Administration Programs
## highest: Teacher Training and Education Sciences Programs Third Year College Third Year High School Transport Services Programs Veterinary Programs
## --------------------------------------------------------------------------------
## Household.Head.Job.or.Business.Indicator
## n missing distinct
## 41544 0 2
##
## Value No Job/Business With Job/Business
## Frequency 7536 34008
## Proportion 0.181 0.819
## --------------------------------------------------------------------------------
## Household.Head.Occupation
## n missing distinct
## 34008 7536 378
##
## lowest : Accountants and auditors Accounting and bookkeeping clerks Administrative secretaries and related associate professionals Advertising and public relations managers Agricultural or industrial machinery mechanics and fitters
## highest: Wood products machine operators Wood treaters Woodworking machine setters and setter-operators Word processor and related operators Workers reporting occupations unidentifiable or inadequately defined
## --------------------------------------------------------------------------------
## Household.Head.Class.of.Worker
## n missing distinct
## 34008 7536 7
##
## lowest : Employer in own family-operated farm or business Self-employed wihout any employee Worked for government/government corporation Worked for private establishment Worked for private household
## highest: Worked for government/government corporation Worked for private establishment Worked for private household Worked with pay in own family-operated farm or business Worked without pay in own family-operated farm or business
## --------------------------------------------------------------------------------
## Type.of.Household
## n missing distinct
## 41544 0 3
##
## Extended Family (12932, 0.311), Single Family (28445, 0.685), Two or More
## Nonrelated Persons/Members (167, 0.004)
## --------------------------------------------------------------------------------
## Total.Number.of.Family.members
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 21 0.98 4.635 2.489 1 2
## .25 .50 .75 .90 .95
## 3 4 6 8 9
##
## lowest : 1 2 3 4 5, highest: 17 18 19 20 26
## --------------------------------------------------------------------------------
## Members.with.age.less.than.5.year.old
## n missing distinct Info Mean Gmd
## 41544 0 6 0.658 0.4102 0.6146
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 28705 9317 2933 511 64 14
## Proportion 0.691 0.224 0.071 0.012 0.002 0.000
## --------------------------------------------------------------------------------
## Members.with.age.5...17.years.old
## n missing distinct Info Mean Gmd
## 41544 0 9 0.93 1.363 1.495
##
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 14802 10445 8111 4704 2152 896 318 96 20
## Proportion 0.356 0.251 0.195 0.113 0.052 0.022 0.008 0.002 0.000
## --------------------------------------------------------------------------------
## Total.number.of.family.members.employed
## n missing distinct Info Mean Gmd
## 41544 0 9 0.917 1.273 1.209
##
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 11494 15312 9303 3579 1280 415 116 33 12
## Proportion 0.277 0.369 0.224 0.086 0.031 0.010 0.003 0.001 0.000
## --------------------------------------------------------------------------------
## Type.of.Building.House
## n missing distinct
## 41544 0 6
##
## lowest : Commercial/industrial/agricultural building Duplex Institutional living quarter Multi-unit residential Other building unit (e.g. cave, boat)
## highest: Duplex Institutional living quarter Multi-unit residential Other building unit (e.g. cave, boat) Single house
##
## Commercial/industrial/agricultural building (51, 0.001), Duplex (1084, 0.026),
## Institutional living quarter (9, 0.000), Multi-unit residential (1329, 0.032),
## Other building unit (e.g. cave, boat) (2, 0.000), Single house (39069, 0.940)
## --------------------------------------------------------------------------------
## Type.of.Roof
## n missing distinct
## 41544 0 7
##
## lowest : Light material (cogon,nipa,anahaw) Mixed but predominantly light materials Mixed but predominantly salvaged materials Mixed but predominantly strong materials Not Applicable
## highest: Mixed but predominantly salvaged materials Mixed but predominantly strong materials Not Applicable Salvaged/makeshift materials Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## --------------------------------------------------------------------------------
## Type.of.Walls
## n missing distinct
## 41544 0 6
##
## lowest : Light NOt applicable Quite Strong Salvaged Strong
## highest: NOt applicable Quite Strong Salvaged Strong Very Light
##
## Value Light NOt applicable Quite Strong Salvaged
## Frequency 8267 12 3487 456
## Proportion 0.199 0.000 0.084 0.011
##
## Value Strong Very Light
## Frequency 27739 1583
## Proportion 0.668 0.038
## --------------------------------------------------------------------------------
## House.Floor.Area
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 313 0.999 55.6 46.87 12 16
## .25 .50 .75 .90 .95
## 25 40 70 100 150
##
## lowest : 5 6 7 8 9, highest: 820 840 868 900 998
## --------------------------------------------------------------------------------
## House.Age
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 111 0.999 20.13 15.23 2 5
## .25 .50 .75 .90 .95
## 10 17 26 39 47
##
## lowest : 0 1 2 3 4, highest: 120 132 135 150 200
## --------------------------------------------------------------------------------
## Number.of.bedrooms
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 10 0.911 1.788 1.162 0 1
## .25 .50 .75 .90 .95
## 1 2 2 3 4
##
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 3930 13431 15456 6111 1875 484 169 46 29 13
## Proportion 0.095 0.323 0.372 0.147 0.045 0.012 0.004 0.001 0.001 0.000
## --------------------------------------------------------------------------------
## Tenure.Status
## n missing distinct
## 41544 0 8
##
## lowest : Not Applicable Own house, rent lot Own house, rent-free lot with consent of owner Own house, rent-free lot without consent of owner Own or owner-like possession of house and lot
## highest: Own house, rent-free lot without consent of owner Own or owner-like possession of house and lot Rent house/room including lot Rent-free house and lot with consent of owner Rent-free house and lot without consent of owner
## --------------------------------------------------------------------------------
## Toilet.Facilities
## n missing distinct
## 41544 0 8
##
## lowest : Closed pit None Open pit Others Water-sealed, other depository, shared with other household
## highest: Others Water-sealed, other depository, shared with other household Water-sealed, other depository, used exclusively by household Water-sealed, sewer septic tank, shared with other household Water-sealed, sewer septic tank, used exclusively by household
## --------------------------------------------------------------------------------
## Electricity
## n missing distinct Info Sum Mean Gmd
## 41544 0 2 0.292 37008 0.8908 0.1945
##
## --------------------------------------------------------------------------------
## Main.Source.of.Water.Supply
## n missing distinct
## 41544 0 11
##
## lowest : Dug well Lake, river, rain and others Others Own use, faucet, community water system Own use, tubed/piped deep well
## highest: Protected spring, river, stream, etc Shared, faucet, community water system Shared, tubed/piped deep well Tubed/piped shallow well Unprotected spring, river, stream, etc
## --------------------------------------------------------------------------------
## Number.of.Television
## n missing distinct Info Mean Gmd
## 41544 0 7 0.705 0.8569 0.5956
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##
## Value 0 1 2 3 4 5 6
## Frequency 10717 27089 2955 597 133 42 11
## Proportion 0.258 0.652 0.071 0.014 0.003 0.001 0.000
## --------------------------------------------------------------------------------
## Number.of.CD.VCD.DVD
## n missing distinct Info Mean Gmd
## 41544 0 6 0.735 0.4352 0.5375
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 24621 15983 752 163 20 5
## Proportion 0.593 0.385 0.018 0.004 0.000 0.000
## --------------------------------------------------------------------------------
## Number.of.Component.Stereo.set
## n missing distinct Info Mean Gmd
## 41544 0 6 0.396 0.1621 0.2755
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 35058 6284 174 13 10 5
## Proportion 0.844 0.151 0.004 0.000 0.000 0.000
## --------------------------------------------------------------------------------
## Number.of.Refrigerator.Freezer
## n missing distinct Info Mean Gmd
## 41544 0 6 0.709 0.3942 0.5075
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 25990 14881 569 73 17 14
## Proportion 0.626 0.358 0.014 0.002 0.000 0.000
## --------------------------------------------------------------------------------
## Number.of.Washing.Machine
## n missing distinct Info Mean Gmd
## 41544 0 4 0.648 0.3198 0.4419
##
## Value 0 1 2 3
## Frequency 28484 12845 204 11
## Proportion 0.686 0.309 0.005 0.000
## --------------------------------------------------------------------------------
## Number.of.Airconditioner
## n missing distinct Info Mean Gmd
## 41544 0 6 0.267 0.1298 0.2392
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 37457 3178 622 199 66 22
## Proportion 0.902 0.076 0.015 0.005 0.002 0.001
## --------------------------------------------------------------------------------
## Number.of.Car..Jeep..Van
## n missing distinct Info Mean Gmd
## 41544 0 6 0.18 0.08122 0.1538
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 38876 2136 413 77 29 13
## Proportion 0.936 0.051 0.010 0.002 0.001 0.000
## --------------------------------------------------------------------------------
## Number.of.Landline.wireless.telephones
## n missing distinct Info Mean Gmd
## 41544 0 5 0.153 0.06061 0.1154
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 39302 2070 96 48 28
## Proportion 0.946 0.050 0.002 0.001 0.001
## --------------------------------------------------------------------------------
## Number.of.Cellular.phone
## n missing distinct Info Mean Gmd .05 .10
## 41544 0 11 0.949 1.906 1.646 0 0
## .25 .50 .75 .90 .95
## 1 2 3 4 5
##
## lowest : 0 1 2 3 4, highest: 6 7 8 9 10
##
## Value 0 1 2 3 4 5 6 7 8 9 10
## Frequency 6939 12484 10377 5820 3281 1467 666 242 153 49 66
## Proportion 0.167 0.301 0.250 0.140 0.079 0.035 0.016 0.006 0.004 0.001 0.002
## --------------------------------------------------------------------------------
## Number.of.Personal.Computer
## n missing distinct Info Mean Gmd
## 41544 0 7 0.497 0.315 0.5339
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##
## Value 0 1 2 3 4 5 6
## Frequency 32988 5650 1836 667 271 112 20
## Proportion 0.794 0.136 0.044 0.016 0.007 0.003 0.000
## --------------------------------------------------------------------------------
## Number.of.Stove.with.Oven.Gas.Range
## n missing distinct Info Mean Gmd
## 41544 0 4 0.342 0.135 0.2357
##
## Value 0 1 2 3
## Frequency 36101 5287 145 11
## Proportion 0.869 0.127 0.003 0.000
## --------------------------------------------------------------------------------
## Number.of.Motorized.Banca
## n missing distinct Info Mean Gmd
## 41544 0 4 0.035 0.01312 0.02596
##
## Value 0 1 2 3
## Frequency 41055 444 34 11
## Proportion 0.988 0.011 0.001 0.000
## --------------------------------------------------------------------------------
## Number.of.Motorcycle.Tricycle
## n missing distinct Info Mean Gmd
## 41544 0 6 0.564 0.2899 0.4552
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 31282 8811 1199 186 54 12
## Proportion 0.753 0.212 0.029 0.004 0.001 0.000
## --------------------------------------------------------------------------------
# ----- Coeficiente de simetria de cada una de las variables numéricas -----
nums <- datos %>%
select_if(is.numeric)
skewness(nums)
## Total.Household.Income
## 8.8963098
## Total.Food.Expenditure
## 2.2309606
## Agricultural.Household.indicator
## 1.2857076
## Bread.and.Cereals.Expenditure
## 7.0110325
## Total.Rice.Expenditure
## 8.9897100
## Meat.Expenditure
## 2.6044671
## Total.Fish.and..marine.products.Expenditure
## 2.8673929
## Fruit.Expenditure
## 21.6962949
## Vegetables.Expenditure
## 2.5142803
## Restaurant.and.hotels.Expenditure
## 5.7407231
## Alcoholic.Beverages.Expenditure
## 5.9003524
## Tobacco.Expenditure
## 4.1225588
## Clothing..Footwear.and.Other.Wear.Expenditure
## 8.3542849
## Housing.and.water.Expenditure
## 9.7024646
## Imputed.House.Rental.Value
## 13.5800101
## Medical.Care.Expenditure
## 15.0488827
## Transportation.Expenditure
## 8.5767093
## Communication.Expenditure
## 4.2437244
## Education.Expenditure
## 8.7911722
## Miscellaneous.Goods.and.Services.Expenditure
## 6.0453470
## Special.Occasions.Expenditure
## 9.5947743
## Crop.Farming.and.Gardening.expenses
## 23.3872787
## Total.Income.from.Entrepreneurial.Acitivites
## 19.7165572
## Household.Head.Age
## 0.2369655
## Total.Number.of.Family.members
## 0.8668230
## Members.with.age.less.than.5.year.old
## 1.7905377
## Members.with.age.5...17.years.old
## 1.0535841
## Total.number.of.family.members.employed
## 1.0720297
## House.Floor.Area
## 4.3806605
## House.Age
## 1.3687074
## Number.of.bedrooms
## 0.8778295
## Electricity
## -2.5062518
## Number.of.Television
## 1.0914439
## Number.of.CD.VCD.DVD
## 1.0769677
## Number.of.Component.Stereo.set
## 2.4742277
## Number.of.Refrigerator.Freezer
## 1.1702791
## Number.of.Washing.Machine
## 0.9427084
## Number.of.Airconditioner
## 4.5715571
## Number.of.Car..Jeep..Van
## 5.6337177
## Number.of.Landline.wireless.telephones
## 6.0636086
## Number.of.Cellular.phone
## 1.2011583
## Number.of.Personal.Computer
## 3.0470314
## Number.of.Stove.with.Oven.Gas.Range
## 2.4572737
## Number.of.Motorized.Banca
## 11.5458788
## Number.of.Motorcycle.Tricycle
## 2.2262507
cat <- datos %>%
select_if(is.factor)
A la luz de la escasa documentación referida al conjunto de datos, ha sido imposible descifrar el significado de algunas variables (por ejemplo, Agricultural.Household.indicator). Por ello, se decide eliminar aquellas cuya interpretación es desconocida.
# ----- Eliminación de variables del dataset -----
datos<-datos%>%select(-Agricultural.Household.indicator,-Members.with.age.less.than.5.year.old,-Members.with.age.5...17.years.old
,-Household.Head.Occupation)
Una vez descartadas aquellas variables, se irán etiquetando como NA todos aquellos valores considerados erróneos o no recogidos (missing values). Estos vendrán normalmente etiquetados por unknown, not applicable o 0. Sin embargo, en este último caso es necesario tener cuidado, ya que algunas variables pueden tomar valor 0 y esto ser correcto, debido al tipo de datos que son (valores socio-económicos).
Además, se categorizarán ciertas variables, seleccionando las posibles categorías que podrán adquirir.
# -----Corrección de valores en variables y categorización -----
levels(datos$Main.Source.of.Income)
## [1] "Enterpreneurial Activities" "Other sources of Income"
## [3] "Wage/Salaries"
summary(datos$Main.Source.of.Income)
## Enterpreneurial Activities Other sources of Income
## 10320 10836
## Wage/Salaries
## 20388
datos$Main.Source.of.Income = factor(datos$Main.Source.of.Income,ordered=TRUE,levels=(c('Other sources of Income'
, 'Enterpreneurial Activities'
, 'Wage/Salaries')))
levels(datos$Main.Source.of.Income)
## [1] "Other sources of Income" "Enterpreneurial Activities"
## [3] "Wage/Salaries"
#--------------------------------------------------
levels(datos$Household.Head.Marital.Status)
## [1] "Annulled" "Divorced/Separated" "Married"
## [4] "Single" "Unknown" "Widowed"
summary(datos$Household.Head.Marital.Status)
## Annulled Divorced/Separated Married Single
## 11 1425 31347 1942
## Unknown Widowed
## 1 6818
datos$Household.Head.Marital.Status[which(datos$Household.Head.Marital.Status=='Unknown')] <-NA # Se etiqueta como NA el valor "Unknown" (desconocido)
datos$Household.Head.Marital.Status<-fct_drop(datos$Household.Head.Marital.Status)
levels(datos$Household.Head.Marital.Status)
## [1] "Annulled" "Divorced/Separated" "Married"
## [4] "Single" "Widowed"
datos$Household.Head.Marital.Status =
factor(datos$Household.Head.Marital.Status,ordered=TRUE,levels=
(c('Single'
,'Widowed'
,'Annulled'
,'Divorced/Separated'
,'Married')))
levels(datos$Household.Head.Marital.Status)
## [1] "Single" "Widowed" "Annulled"
## [4] "Divorced/Separated" "Married"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Household.Head.Class.of.Worker)
## [1] "Employer in own family-operated farm or business"
## [2] "Self-employed wihout any employee"
## [3] "Worked for government/government corporation"
## [4] "Worked for private establishment"
## [5] "Worked for private household"
## [6] "Worked with pay in own family-operated farm or business"
## [7] "Worked without pay in own family-operated farm or business"
summary(datos$Household.Head.Class.of.Worker)
## Employer in own family-operated farm or business
## 2581
## Self-employed wihout any employee
## 13766
## Worked for government/government corporation
## 2820
## Worked for private establishment
## 13731
## Worked for private household
## 811
## Worked with pay in own family-operated farm or business
## 14
## Worked without pay in own family-operated farm or business
## 285
## NA's
## 7536
datos$Household.Head.Class.of.Worker =
factor(datos$Household.Head.Class.of.Worker,ordered=TRUE,levels=
(c('Worked without pay in own family-operated farm or business'
,'Employer in own family-operated farm or business'
,'Worked with pay in own family-operated farm or business'
,'Self-employed wihout any employee'
,'Worked for private household'
,'Worked for private establishment'
,'Worked for government/government corporation')))
levels(datos$Household.Head.Class.of.Worker)
## [1] "Worked without pay in own family-operated farm or business"
## [2] "Employer in own family-operated farm or business"
## [3] "Worked with pay in own family-operated farm or business"
## [4] "Self-employed wihout any employee"
## [5] "Worked for private household"
## [6] "Worked for private establishment"
## [7] "Worked for government/government corporation"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Type.of.Household)
## [1] "Extended Family"
## [2] "Single Family"
## [3] "Two or More Nonrelated Persons/Members"
datos$Type.of.Household =
factor(datos$Type.of.Household,ordered=TRUE,levels=
(c('Single Family'
,'Two or More Nonrelated Persons/Members'
,'Extended Family')))
levels(datos$Type.of.Household)
## [1] "Single Family"
## [2] "Two or More Nonrelated Persons/Members"
## [3] "Extended Family"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Type.of.Building.House)
## [1] "Commercial/industrial/agricultural building"
## [2] "Duplex"
## [3] "Institutional living quarter"
## [4] "Multi-unit residential"
## [5] "Other building unit (e.g. cave, boat)"
## [6] "Single house"
datos$Type.of.Building.House =
factor(datos$Type.of.Building.House,ordered=TRUE,levels=
(c('Other building unit (e.g. cave, boat)'
,'Institutional living quarter'
,'Commercial/industrial/agricultural building'
,'Single house'
,'Duplex'
,'Multi-unit residential')))
levels(datos$Type.of.Building.House)
## [1] "Other building unit (e.g. cave, boat)"
## [2] "Institutional living quarter"
## [3] "Commercial/industrial/agricultural building"
## [4] "Single house"
## [5] "Duplex"
## [6] "Multi-unit residential"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Type.of.Roof)
## [1] "Light material (cogon,nipa,anahaw)"
## [2] "Mixed but predominantly light materials"
## [3] "Mixed but predominantly salvaged materials"
## [4] "Mixed but predominantly strong materials"
## [5] "Not Applicable"
## [6] "Salvaged/makeshift materials"
## [7] "Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)"
summary(datos$Type.of.Roof)
## Light material (cogon,nipa,anahaw)
## 5074
## Mixed but predominantly light materials
## 846
## Mixed but predominantly salvaged materials
## 56
## Mixed but predominantly strong materials
## 2002
## Not Applicable
## 12
## Salvaged/makeshift materials
## 212
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## 33342
datos$Type.of.Roof[which(datos$Type.of.Roof=='Not Applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Type.of.Roof<-fct_drop(datos$Type.of.Roof)
levels(datos$Type.of.Roof)
## [1] "Light material (cogon,nipa,anahaw)"
## [2] "Mixed but predominantly light materials"
## [3] "Mixed but predominantly salvaged materials"
## [4] "Mixed but predominantly strong materials"
## [5] "Salvaged/makeshift materials"
## [6] "Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)"
summary(datos$Type.of.Roof)
## Light material (cogon,nipa,anahaw)
## 5074
## Mixed but predominantly light materials
## 846
## Mixed but predominantly salvaged materials
## 56
## Mixed but predominantly strong materials
## 2002
## Salvaged/makeshift materials
## 212
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## 33342
## NA's
## 12
datos$Type.of.Roof =
factor(datos$Type.of.Roof,ordered=TRUE,levels=
(c('Salvaged/makeshift materials'
,'Light material (cogon,nipa,anahaw)'
,'Mixed but predominantly salvaged materials'
,'Mixed but predominantly light materials'
,'Mixed but predominantly strong materials'
,'Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)')))
levels(datos$Type.of.Roof)
## [1] "Salvaged/makeshift materials"
## [2] "Light material (cogon,nipa,anahaw)"
## [3] "Mixed but predominantly salvaged materials"
## [4] "Mixed but predominantly light materials"
## [5] "Mixed but predominantly strong materials"
## [6] "Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Type.of.Walls)
## [1] "Light" "NOt applicable" "Quite Strong" "Salvaged"
## [5] "Strong" "Very Light"
summary(datos$Type.of.Walls)
## Light NOt applicable Quite Strong Salvaged Strong
## 8267 12 3487 456 27739
## Very Light
## 1583
datos$Type.of.Walls[which(datos$Type.of.Walls=='Not applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Type.of.Walls<-fct_drop(datos$Type.of.Walls)
levels(datos$Type.of.Walls)
## [1] "Light" "NOt applicable" "Quite Strong" "Salvaged"
## [5] "Strong" "Very Light"
summary(datos$Type.of.Walls)
## Light NOt applicable Quite Strong Salvaged Strong
## 8267 12 3487 456 27739
## Very Light
## 1583
datos$Type.of.Walls=
factor(datos$Type.of.Walls,ordered=TRUE,levels=
(c('Salvaged'
,'Very Light'
,'Light'
,'Strong'
,'Quite Strong')))
levels(datos$Type.of.Walls)
## [1] "Salvaged" "Very Light" "Light" "Strong" "Quite Strong"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Toilet.Facilities)
## [1] "Closed pit"
## [2] "None"
## [3] "Open pit"
## [4] "Others"
## [5] "Water-sealed, other depository, shared with other household"
## [6] "Water-sealed, other depository, used exclusively by household"
## [7] "Water-sealed, sewer septic tank, shared with other household"
## [8] "Water-sealed, sewer septic tank, used exclusively by household"
summary(datos$Toilet.Facilities)
## Closed pit
## 2273
## None
## 1580
## Open pit
## 1189
## Others
## 353
## Water-sealed, other depository, shared with other household
## 950
## Water-sealed, other depository, used exclusively by household
## 2343
## Water-sealed, sewer septic tank, shared with other household
## 3694
## Water-sealed, sewer septic tank, used exclusively by household
## 29162
datos$Toilet.Facilities=
factor(datos$Toilet.Facilities,ordered=TRUE,levels=
(c('None'
,'Others'
,'Open pit'
,'Closed pit'
,'Water-sealed, other depository, shared with other household'
,'Water-sealed, other depository, used exclusively by household'
,'Water-sealed, sewer septic tank, shared with other household'
,'Water-sealed, sewer septic tank, used exclusively by household')))
levels(datos$Toilet.Facilities)
## [1] "None"
## [2] "Others"
## [3] "Open pit"
## [4] "Closed pit"
## [5] "Water-sealed, other depository, shared with other household"
## [6] "Water-sealed, other depository, used exclusively by household"
## [7] "Water-sealed, sewer septic tank, shared with other household"
## [8] "Water-sealed, sewer septic tank, used exclusively by household"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Household.Head.Occupation)
## NULL
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Main.Source.of.Water.Supply)
## [1] "Dug well"
## [2] "Lake, river, rain and others"
## [3] "Others"
## [4] "Own use, faucet, community water system"
## [5] "Own use, tubed/piped deep well"
## [6] "Peddler"
## [7] "Protected spring, river, stream, etc"
## [8] "Shared, faucet, community water system"
## [9] "Shared, tubed/piped deep well"
## [10] "Tubed/piped shallow well"
## [11] "Unprotected spring, river, stream, etc"
datos$Main.Source.of.Water.Supply=
factor(datos$Main.Source.of.Water.Supply,ordered=TRUE,levels=
(c('Others'
,'Dug well'
,'Lake, river, rain and others'
,'Unprotected spring, river, stream, etc'
,'Protected spring, river, stream, etc'
,'Tubed/piped shallow well'
,'Shared, tubed/piped deep well'
,'Own use, tubed/piped deep well'
,'Peddler'
,'Shared, faucet, community water system'
,'Own use, faucet, community water system')))
levels(datos$Main.Source.of.Water.Supply)
## [1] "Others"
## [2] "Dug well"
## [3] "Lake, river, rain and others"
## [4] "Unprotected spring, river, stream, etc"
## [5] "Protected spring, river, stream, etc"
## [6] "Tubed/piped shallow well"
## [7] "Shared, tubed/piped deep well"
## [8] "Own use, tubed/piped deep well"
## [9] "Peddler"
## [10] "Shared, faucet, community water system"
## [11] "Own use, faucet, community water system"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Tenure.Status)
## [1] "Not Applicable"
## [2] "Own house, rent lot"
## [3] "Own house, rent-free lot with consent of owner"
## [4] "Own house, rent-free lot without consent of owner"
## [5] "Own or owner-like possession of house and lot"
## [6] "Rent house/room including lot"
## [7] "Rent-free house and lot with consent of owner"
## [8] "Rent-free house and lot without consent of owner"
datos$Tenure.Status[which(datos$Tenure.Status=='Not Applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Tenure.Status<-fct_drop(datos$Tenure.Status)
levels(datos$Tenure.Status)
## [1] "Own house, rent lot"
## [2] "Own house, rent-free lot with consent of owner"
## [3] "Own house, rent-free lot without consent of owner"
## [4] "Own or owner-like possession of house and lot"
## [5] "Rent house/room including lot"
## [6] "Rent-free house and lot with consent of owner"
## [7] "Rent-free house and lot without consent of owner"
summary(datos$Tenure.Status)
## Own house, rent lot
## 425
## Own house, rent-free lot with consent of owner
## 6165
## Own house, rent-free lot without consent of owner
## 995
## Own or owner-like possession of house and lot
## 29541
## Rent house/room including lot
## 2203
## Rent-free house and lot with consent of owner
## 2014
## Rent-free house and lot without consent of owner
## 128
## NA's
## 73
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Electricity)
## NULL
summary(datos$Electricity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.8908 1.0000 1.0000
ggplot(datos, aes(x=Number.of.Airconditioner,fill= Electricity)) + geom_bar(position = "dodge")
La variable “Electricity” es un claro ejemplo de la importancia de no tratar como NA valores iguales a 0. Puesto que no existe ninguna descripción de las variables del dataset, más allá del propio nombre, se trata de ver a qué se refieren esos 0. A la vista de la gráfica, se concluye que todos los usuarios que tienen aire acondicionado, tienen un 1 en Electricity, y que ningún usuario con un 0 tiene aire acondicionado, por lo que es posible afirmar que el 1 corresponde a tener electricidad, y el 0 a no tenerla.
Para clarificar, será categorizada con valores de “Si” y “No”, que sustituirán a los unos y ceros, respectivamente.
# Sustitución de 0/1 por No/Si
datos$Electricity[which(datos$Electricity=='0')] <- 'No'
datos$Electricity[which(datos$Electricity=='1')] <- 'Si'
# Transformación de la variable Electricity a categórica, por ser binaria (0 o 1 / No o Si)
datos$Electricity<-as.factor(datos$Electricity)
datos$Electricity =
factor(datos$Electricity,ordered=TRUE,levels=
(c('No','Si')))
El último grupo de variables, corresponde al número de bienes adquiridos. Dichas variables son marcadas como numéricas, pero sus rangos son muy reducidos con respecto a las demás numéricas. Se inspecciona más a fondo estas variables viendo la media por familia filipina de los diferentes bienes.
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.bedrooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.788 2.000 9.000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Refrigerator.Freezer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3942 1.0000 5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Washing.Machine)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3198 1.0000 3.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Airconditioner)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1298 0.0000 5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Car..Jeep..Van)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08121 0.00000 5.00000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.CD.VCD.DVD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4352 1.0000 5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Cellular.phone)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.906 3.000 10.000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Component.Stereo.set)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1621 0.0000 5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Landline.wireless.telephones)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06061 0.00000 4.00000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Personal.Computer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.315 0.000 6.000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Motorcycle.Tricycle)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2899 0.0000 5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Stove.with.Oven.Gas.Range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.135 0.000 3.000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Television)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.8569 1.0000 6.0000
#---------------------------------------------------------------------------------------------------------------------------------------------
summary(datos$Number.of.Motorized.Banca)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01312 0.00000 3.00000
#---------------------------------------------------------------------------------------------------------------------------------------------
Una vez están los datos ordenados, y debido a que el volumen de la muestra inicial podría ser un problema al tratar con ellos, se procede a realizar un muestreo de 10000 observaciones con muestreo aleatorio simple, fijando una semilla aleatoria.
Se divide la muestra de 10000 observaciones en dos conjuntos: uno de train y otro de test/validación (70%-30%). Se trabajará con el conjunto de train, mientras que el de test será reservado para la parte final (evaluación del modelo).
# ----- Creación de una muestra del conjunto inicial de datos con muestreo aleatorio simple sin reemplazamiento -----
set.seed(300)
datos_s <- datos %>%
sample_n(size=10000,replace=FALSE)
# División de la muestra de 10000 observaciones en dos conjuntos: uno de train y otro de test (70%-30%)
training <- createDataPartition(pull(datos_s, Total.Household.Income ),
p = 0.7, list = FALSE, times = 1)
datos_training <- slice(datos_s, training)
datos_testing <- slice(datos_s, -training)
var_train_cat <- datos_training%>%select_if(is.factor)
var_train_num <- datos_training%>%select_if(is.numeric)
Para estudiar más a fondo las variables cualitativas, es conveniente ver sus frecuencias absolutas, una a una, con ayuda de la función table()
# ----- Frecuencias absolutas y relativas ------
# Frecuencias absolutas - función table() (tabla de contingencia)
table(var_train_cat$Region)
##
## ARMM CAR Caraga
## 375 299 299
## I - Ilocos Region II - Cagayan Valley III - Central Luzon
## 401 367 532
## IVA - CALABARZON IVB - MIMAROPA IX - Zasmboanga Peninsula
## 658 193 319
## NCR V - Bicol Region VI - Western Visayas
## 724 412 497
## VII - Central Visayas VIII - Eastern Visayas X - Northern Mindanao
## 416 397 296
## XI - Davao Region XII - SOCCSKSARGEN
## 432 383
table(var_train_cat$Main.Source.of.Income)
##
## Other sources of Income Enterpreneurial Activities
## 1790 1771
## Wage/Salaries
## 3439
table(var_train_cat$Household.Head.Sex)
##
## Female Male
## 1513 5487
table(var_train_cat$Household.Head.Marital.Status)
##
## Single Widowed Annulled Divorced/Separated
## 319 1143 4 223
## Married
## 5311
table(var_train_cat$Household.Head.Job.or.Business.Indicator)
##
## No Job/Business With Job/Business
## 1241 5759
table(var_train_cat$Household.Head.Class.of.Worker)
##
## Worked without pay in own family-operated farm or business
## 50
## Employer in own family-operated farm or business
## 434
## Worked with pay in own family-operated farm or business
## 4
## Self-employed wihout any employee
## 2304
## Worked for private household
## 139
## Worked for private establishment
## 2339
## Worked for government/government corporation
## 489
table(var_train_cat$Type.of.Household)
##
## Single Family Two or More Nonrelated Persons/Members
## 4790 24
## Extended Family
## 2186
table(var_train_cat$Type.of.Building.House)
##
## Other building unit (e.g. cave, boat)
## 0
## Institutional living quarter
## 2
## Commercial/industrial/agricultural building
## 7
## Single house
## 6584
## Duplex
## 173
## Multi-unit residential
## 234
table(var_train_cat$Type.of.Roof)
##
## Salvaged/makeshift materials
## 32
## Light material (cogon,nipa,anahaw)
## 845
## Mixed but predominantly salvaged materials
## 11
## Mixed but predominantly light materials
## 133
## Mixed but predominantly strong materials
## 345
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## 5633
table(var_train_cat$Type.of.Walls)
##
## Salvaged Very Light Light Strong Quite Strong
## 81 258 1366 4701 590
table(var_train_cat$Tenure.Status)
##
## Own house, rent lot
## 81
## Own house, rent-free lot with consent of owner
## 1005
## Own house, rent-free lot without consent of owner
## 151
## Own or owner-like possession of house and lot
## 5011
## Rent house/room including lot
## 389
## Rent-free house and lot with consent of owner
## 336
## Rent-free house and lot without consent of owner
## 20
table(var_train_cat$Toilet.Facilities)
##
## None
## 255
## Others
## 65
## Open pit
## 206
## Closed pit
## 361
## Water-sealed, other depository, shared with other household
## 133
## Water-sealed, other depository, used exclusively by household
## 428
## Water-sealed, sewer septic tank, shared with other household
## 646
## Water-sealed, sewer septic tank, used exclusively by household
## 4906
table(var_train_cat$Electricity)
##
## No Si
## 753 6247
table(var_train_cat$Electricity)
##
## No Si
## 753 6247
table(var_train_cat$Main.Source.of.Water.Supply)
##
## Others Dug well
## 16 663
## Lake, river, rain and others Unprotected spring, river, stream, etc
## 81 105
## Protected spring, river, stream, etc Tubed/piped shallow well
## 459 251
## Shared, tubed/piped deep well Own use, tubed/piped deep well
## 1012 775
## Peddler Shared, faucet, community water system
## 147 769
## Own use, faucet, community water system
## 2722
table(var_train_cat$Number.of.Motorcycle.Tricycle)
## < table of extent 0 >
table(var_train_cat$Household.Head.Highest.Grade.Completed)
##
## Agriculture, Forestry, and Fishery Programs
## 40
## Architecture and Building Programs
## 6
## Arts Programs
## 4
## Basic Programs
## 6
## Business and Administration Programs
## 212
## Computing/Information Technology Programs
## 51
## Elementary Graduate
## 1261
## Engineering and Engineering trades Programs
## 81
## Engineering and Engineering Trades Programs
## 147
## Environmental Protection Programs
## 2
## First Year College
## 145
## First Year High School
## 209
## First Year Post Secondary
## 20
## Fourth Year College
## 17
## Grade 1
## 152
## Grade 2
## 263
## Grade 3
## 309
## Grade 4
## 366
## Grade 5
## 348
## Grade 6
## 49
## Health Programs
## 66
## High School Graduate
## 1665
## Humanities Programs
## 9
## Journalism and Information Programs
## 7
## Law Programs
## 5
## Life Sciences Programs
## 4
## Manufacturing and Processing Programs
## 3
## Mathematics and Statistics Programs
## 2
## No Grade Completed
## 195
## Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree
## 13
## Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level)
## 0
## Personal Services Programs
## 22
## Physical Sciences Programs
## 3
## Post Baccalaureate
## 39
## Preschool
## 3
## Second Year College
## 216
## Second Year High School
## 363
## Second Year Post Secondary
## 22
## Security Services Programs
## 52
## Social and Behavioral Science Programs
## 23
## Social Services Programs
## 0
## Teacher Training and Education Sciences Programs
## 159
## Third Year College
## 167
## Third Year High School
## 234
## Transport Services Programs
## 39
## Veterinary Programs
## 1
Seguidamente, será repetido el mismo proceso para ver las frecuencias relativas, esta vez utilizando la función prop.table()
# Frecuencias relativas - función prop.table
prop.table(table(var_train_cat$Region))
##
## ARMM CAR Caraga
## 0.05357143 0.04271429 0.04271429
## I - Ilocos Region II - Cagayan Valley III - Central Luzon
## 0.05728571 0.05242857 0.07600000
## IVA - CALABARZON IVB - MIMAROPA IX - Zasmboanga Peninsula
## 0.09400000 0.02757143 0.04557143
## NCR V - Bicol Region VI - Western Visayas
## 0.10342857 0.05885714 0.07100000
## VII - Central Visayas VIII - Eastern Visayas X - Northern Mindanao
## 0.05942857 0.05671429 0.04228571
## XI - Davao Region XII - SOCCSKSARGEN
## 0.06171429 0.05471429
prop.table(table(var_train_cat$Main.Source.of.Income))
##
## Other sources of Income Enterpreneurial Activities
## 0.2557143 0.2530000
## Wage/Salaries
## 0.4912857
prop.table(table(var_train_cat$Household.Head.Sex))
##
## Female Male
## 0.2161429 0.7838571
prop.table(table(var_train_cat$Household.Head.Marital.Status))
##
## Single Widowed Annulled Divorced/Separated
## 0.0455714286 0.1632857143 0.0005714286 0.0318571429
## Married
## 0.7587142857
prop.table(table(var_train_cat$Household.Head.Job.or.Business.Indicator))
##
## No Job/Business With Job/Business
## 0.1772857 0.8227143
prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
##
## Worked without pay in own family-operated farm or business
## 0.008682063
## Employer in own family-operated farm or business
## 0.075360306
## Worked with pay in own family-operated farm or business
## 0.000694565
## Self-employed wihout any employee
## 0.400069457
## Worked for private household
## 0.024136135
## Worked for private establishment
## 0.406146901
## Worked for government/government corporation
## 0.084910575
prop.table(table(var_train_cat$Type.of.Household))
##
## Single Family Two or More Nonrelated Persons/Members
## 0.684285714 0.003428571
## Extended Family
## 0.312285714
prop.table(table(var_train_cat$Type.of.Building.House))
##
## Other building unit (e.g. cave, boat)
## 0.0000000000
## Institutional living quarter
## 0.0002857143
## Commercial/industrial/agricultural building
## 0.0010000000
## Single house
## 0.9405714286
## Duplex
## 0.0247142857
## Multi-unit residential
## 0.0334285714
prop.table(table(var_train_cat$Type.of.Roof))
##
## Salvaged/makeshift materials
## 0.004572082
## Light material (cogon,nipa,anahaw)
## 0.120731533
## Mixed but predominantly salvaged materials
## 0.001571653
## Mixed but predominantly light materials
## 0.019002715
## Mixed but predominantly strong materials
## 0.049292756
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## 0.804829261
prop.table(table(var_train_cat$Type.of.Walls))
##
## Salvaged Very Light Light Strong Quite Strong
## 0.01157804 0.03687822 0.19525443 0.67195540 0.08433391
prop.table(table(var_train_cat$Tenure.Status))
##
## Own house, rent lot
## 0.011583012
## Own house, rent-free lot with consent of owner
## 0.143715144
## Own house, rent-free lot without consent of owner
## 0.021593022
## Own or owner-like possession of house and lot
## 0.716573717
## Rent house/room including lot
## 0.055627056
## Rent-free house and lot with consent of owner
## 0.048048048
## Rent-free house and lot without consent of owner
## 0.002860003
prop.table(table(var_train_cat$Toilet.Facilities))
##
## None
## 0.036428571
## Others
## 0.009285714
## Open pit
## 0.029428571
## Closed pit
## 0.051571429
## Water-sealed, other depository, shared with other household
## 0.019000000
## Water-sealed, other depository, used exclusively by household
## 0.061142857
## Water-sealed, sewer septic tank, shared with other household
## 0.092285714
## Water-sealed, sewer septic tank, used exclusively by household
## 0.700857143
prop.table(table(var_train_cat$Electricity))
##
## No Si
## 0.1075714 0.8924286
prop.table(table(var_train_cat$Electricity))
##
## No Si
## 0.1075714 0.8924286
prop.table(table(var_train_cat$Main.Source.of.Water.Supply))
##
## Others Dug well
## 0.002285714 0.094714286
## Lake, river, rain and others Unprotected spring, river, stream, etc
## 0.011571429 0.015000000
## Protected spring, river, stream, etc Tubed/piped shallow well
## 0.065571429 0.035857143
## Shared, tubed/piped deep well Own use, tubed/piped deep well
## 0.144571429 0.110714286
## Peddler Shared, faucet, community water system
## 0.021000000 0.109857143
## Own use, faucet, community water system
## 0.388857143
prop.table(table(var_train_cat$Household.Head.Highest.Grade.Completed))
##
## Agriculture, Forestry, and Fishery Programs
## 0.0057142857
## Architecture and Building Programs
## 0.0008571429
## Arts Programs
## 0.0005714286
## Basic Programs
## 0.0008571429
## Business and Administration Programs
## 0.0302857143
## Computing/Information Technology Programs
## 0.0072857143
## Elementary Graduate
## 0.1801428571
## Engineering and Engineering trades Programs
## 0.0115714286
## Engineering and Engineering Trades Programs
## 0.0210000000
## Environmental Protection Programs
## 0.0002857143
## First Year College
## 0.0207142857
## First Year High School
## 0.0298571429
## First Year Post Secondary
## 0.0028571429
## Fourth Year College
## 0.0024285714
## Grade 1
## 0.0217142857
## Grade 2
## 0.0375714286
## Grade 3
## 0.0441428571
## Grade 4
## 0.0522857143
## Grade 5
## 0.0497142857
## Grade 6
## 0.0070000000
## Health Programs
## 0.0094285714
## High School Graduate
## 0.2378571429
## Humanities Programs
## 0.0012857143
## Journalism and Information Programs
## 0.0010000000
## Law Programs
## 0.0007142857
## Life Sciences Programs
## 0.0005714286
## Manufacturing and Processing Programs
## 0.0004285714
## Mathematics and Statistics Programs
## 0.0002857143
## No Grade Completed
## 0.0278571429
## Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree
## 0.0018571429
## Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level)
## 0.0000000000
## Personal Services Programs
## 0.0031428571
## Physical Sciences Programs
## 0.0004285714
## Post Baccalaureate
## 0.0055714286
## Preschool
## 0.0004285714
## Second Year College
## 0.0308571429
## Second Year High School
## 0.0518571429
## Second Year Post Secondary
## 0.0031428571
## Security Services Programs
## 0.0074285714
## Social and Behavioral Science Programs
## 0.0032857143
## Social Services Programs
## 0.0000000000
## Teacher Training and Education Sciences Programs
## 0.0227142857
## Third Year College
## 0.0238571429
## Third Year High School
## 0.0334285714
## Transport Services Programs
## 0.0055714286
## Veterinary Programs
## 0.0001428571
Al examinar las visualizaciones, la variable categórica electricity llama la atención. Se procede a comparar la variable electricity por regiones, ya que puede dar una idea acerca de en qué regiones puede existir mayor nivel de pobreza. Esto se realiza mediante la función cross-table, que nos mostrará las frecuencias absolutas, relativas en relación a la fila, frecuencias relativas en relación a la columna y frecuencias relativas globales.
CrossTable(var_train_cat$Region, var_train_cat$Electricity, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 7000
##
##
## | var_train_cat$Electricity
## var_train_cat$Region | No | Si | Row Total |
## --------------------------|-----------|-----------|-----------|
## ARMM | 162 | 213 | 375 |
## | 0.432 | 0.568 | 0.054 |
## | 0.215 | 0.034 | |
## | 0.023 | 0.030 | |
## --------------------------|-----------|-----------|-----------|
## CAR | 18 | 281 | 299 |
## | 0.060 | 0.940 | 0.043 |
## | 0.024 | 0.045 | |
## | 0.003 | 0.040 | |
## --------------------------|-----------|-----------|-----------|
## Caraga | 24 | 275 | 299 |
## | 0.080 | 0.920 | 0.043 |
## | 0.032 | 0.044 | |
## | 0.003 | 0.039 | |
## --------------------------|-----------|-----------|-----------|
## I - Ilocos Region | 20 | 381 | 401 |
## | 0.050 | 0.950 | 0.057 |
## | 0.027 | 0.061 | |
## | 0.003 | 0.054 | |
## --------------------------|-----------|-----------|-----------|
## II - Cagayan Valley | 23 | 344 | 367 |
## | 0.063 | 0.937 | 0.052 |
## | 0.031 | 0.055 | |
## | 0.003 | 0.049 | |
## --------------------------|-----------|-----------|-----------|
## III - Central Luzon | 13 | 519 | 532 |
## | 0.024 | 0.976 | 0.076 |
## | 0.017 | 0.083 | |
## | 0.002 | 0.074 | |
## --------------------------|-----------|-----------|-----------|
## IVA - CALABARZON | 26 | 632 | 658 |
## | 0.040 | 0.960 | 0.094 |
## | 0.035 | 0.101 | |
## | 0.004 | 0.090 | |
## --------------------------|-----------|-----------|-----------|
## IVB - MIMAROPA | 26 | 167 | 193 |
## | 0.135 | 0.865 | 0.028 |
## | 0.035 | 0.027 | |
## | 0.004 | 0.024 | |
## --------------------------|-----------|-----------|-----------|
## IX - Zasmboanga Peninsula | 53 | 266 | 319 |
## | 0.166 | 0.834 | 0.046 |
## | 0.070 | 0.043 | |
## | 0.008 | 0.038 | |
## --------------------------|-----------|-----------|-----------|
## NCR | 6 | 718 | 724 |
## | 0.008 | 0.992 | 0.103 |
## | 0.008 | 0.115 | |
## | 0.001 | 0.103 | |
## --------------------------|-----------|-----------|-----------|
## V - Bicol Region | 47 | 365 | 412 |
## | 0.114 | 0.886 | 0.059 |
## | 0.062 | 0.058 | |
## | 0.007 | 0.052 | |
## --------------------------|-----------|-----------|-----------|
## VI - Western Visayas | 67 | 430 | 497 |
## | 0.135 | 0.865 | 0.071 |
## | 0.089 | 0.069 | |
## | 0.010 | 0.061 | |
## --------------------------|-----------|-----------|-----------|
## VII - Central Visayas | 47 | 369 | 416 |
## | 0.113 | 0.887 | 0.059 |
## | 0.062 | 0.059 | |
## | 0.007 | 0.053 | |
## --------------------------|-----------|-----------|-----------|
## VIII - Eastern Visayas | 64 | 333 | 397 |
## | 0.161 | 0.839 | 0.057 |
## | 0.085 | 0.053 | |
## | 0.009 | 0.048 | |
## --------------------------|-----------|-----------|-----------|
## X - Northern Mindanao | 44 | 252 | 296 |
## | 0.149 | 0.851 | 0.042 |
## | 0.058 | 0.040 | |
## | 0.006 | 0.036 | |
## --------------------------|-----------|-----------|-----------|
## XI - Davao Region | 48 | 384 | 432 |
## | 0.111 | 0.889 | 0.062 |
## | 0.064 | 0.061 | |
## | 0.007 | 0.055 | |
## --------------------------|-----------|-----------|-----------|
## XII - SOCCSKSARGEN | 65 | 318 | 383 |
## | 0.170 | 0.830 | 0.055 |
## | 0.086 | 0.051 | |
## | 0.009 | 0.045 | |
## --------------------------|-----------|-----------|-----------|
## Column Total | 753 | 6247 | 7000 |
## | 0.108 | 0.892 | |
## --------------------------|-----------|-----------|-----------|
##
##
#CrossTable(var_train_cat$Household.Head.Class.of.Worker, var_train_cat$Number.of.Stove.with.Oven.Gas.Range, prop.chisq = FALSE)
Se incorpora al análisis una tercera variable que suscita interés en el estudio: la variable Sex, que indica el sexo de la persona que toma las decisiones en el hogar.
# ----- Estudio de frecuencias multidimensionales -----
# Análisis de la variable electricity/región/sexo
ftable(var_train_cat$Region, var_train_cat$Household.Head.Sex, var_train_cat$Electricity)
## No Si
##
## ARMM Female 9 18
## Male 153 195
## CAR Female 1 65
## Male 17 216
## Caraga Female 4 46
## Male 20 229
## I - Ilocos Region Female 7 97
## Male 13 284
## II - Cagayan Valley Female 1 50
## Male 22 294
## III - Central Luzon Female 4 124
## Male 9 395
## IVA - CALABARZON Female 9 156
## Male 17 476
## IVB - MIMAROPA Female 1 36
## Male 25 131
## IX - Zasmboanga Peninsula Female 7 51
## Male 46 215
## NCR Female 1 194
## Male 5 524
## V - Bicol Region Female 8 86
## Male 39 279
## VI - Western Visayas Female 14 106
## Male 53 324
## VII - Central Visayas Female 10 106
## Male 37 263
## VIII - Eastern Visayas Female 9 77
## Male 55 256
## X - Northern Mindanao Female 8 53
## Male 36 199
## XI - Davao Region Female 11 73
## Male 37 311
## XII - SOCCSKSARGEN Female 11 60
## Male 54 258
Finalmente, se muestra una serie de visualizaciones de los datos mediante diagramas de barras:
# ----- Gráficos EDA con variables cualitativas individuales -----
ggplot(datos, aes(Region)) + geom_bar() + ggtitle("Núm. familias. por Región") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(datos, aes(Main.Source.of.Income)) + geom_bar() + ggtitle("Núm. familias. por fuente de ingresos")
# ----- Visualización de datos cualitativos -----
barplot(table(datos$Region), col = c("lightblue","yellow", "cadetblue4"),
main = "Diagrama de barras de las frecuencias absolutas\n de la variable \"Region\"")
barplot(table(datos$Household.Head.Sex, datos$Electricity),
beside = T,
col = c("yellow", "lightblue"),
names = c("Women", "Men"),
legend.text = c("No", "Yes"))
barplot(prop.table(table(datos$Household.Head.Class.of.Worker,datos$Main.Source.of.Income)),
beside = TRUE, col = c("chocolate","cornsilk1","cornflowerblue","blueviolet", "darkgoldenrod1", "coral", "brown", "chartreuse4"),
legend.text = T, main = "Frecuencias relativas de fuente de\n ingresos por tipo de trabajo",
ylim = c(0,1))
Se dispone a ver la distribución y densidad de cada una de las variables cuantitativas sin transformar, es decir, las variables “en crudo”. De esta manera, se pretende identificar aquellas con los datos más sesgados, y poder observar las distribuciones y rangos que presentan.
# Histograma de las variables cuantitativas sin transformar
summary(var_train_num$Total.Household.Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17840 106132 165773 248096 294968 4942530
qplot(var_train_num$Total.Household.Income,
geom="histogram",
binwidth = 10000,
main = "Histogram for Total Household Income",
xlab = "Total Household Income",
fill=I("blue"),
col=I("red"),
xlim=c(10000,12000000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Total.Food.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6275 51422 73354 85554 106458 720007
qplot(var_train_num$Total.Food.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Total Food Expenditure",
xlab = "Total Food Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(2000,800000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Bread.and.Cereals.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 16665 23196 24978 31200 345643
qplot(var_train_num$Bread.and.Cereals.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Bread.and.Cereals.Expenditure",
xlab = "Bread.and.Cereals.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,350000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Total.Rice.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 10909 16473 18014 23903 343907
qplot(var_train_num$Total.Rice.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Total Rice Expenditure",
xlab = "Total Rice Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,350000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Meat.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3336 7460 10626 14253 261566
qplot(var_train_num$Meat.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Meat.Expenditure",
xlab = "Meat.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,270000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Total.Fish.and..marine.products.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 5492 8649 10489 13212 81675
qplot(var_train_num$Total.Fish.and..marine.products.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Total.Fish.and..marine.products.Expenditure",
xlab = "Total.Fish.and..marine.products.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,190000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Fruit.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1012 1830 2544 3114 82600
qplot(var_train_num$Fruit.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Fruit.Expenditure",
xlab = "Fruit.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,70000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Vegetables.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2876 4393 5066 6400 49810
qplot(var_train_num$Vegetables.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Vegetables.Expenditure",
xlab = "Vegetables.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,80000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Restaurant.and.hotels.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2020 7480 16031 20892 421950
qplot(var_train_num$Restaurant.and.hotels.Expenditure,
geom="histogram",
binwidth = 5000,
main = "Histogram for Restaurant.and.hotels.Expenditure",
xlab = "Restaurant.and.hotels.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-5000,520000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Alcoholic.Beverages.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 276 1095 1300 38220
qplot(var_train_num$Alcoholic.Beverages.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Alcoholic.Beverages.Expenditure",
xlab = "Alcoholic.Beverages.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,36000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Tobacco.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 195 2275 3120 97740
qplot(var_train_num$Tobacco.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Tobacco.Expenditure",
xlab = "Tobacco.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,100000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Clothing..Footwear.and.Other.Wear.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1372 2776 4975 5771 112830
qplot(var_train_num$Clothing..Footwear.and.Other.Wear.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure",
xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,360000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Housing.and.water.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2310 13140 23163 38826 47157 1308180
qplot(var_train_num$Housing.and.water.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Housing.and.water.Expenditure",
xlab = "Housing.and.water.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(2000,842000))
## Warning: Removed 5 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Imputed.House.Rental.Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 6000 10800 21088 24000 1200000
qplot(var_train_num$Imputed.House.Rental.Value,
geom="histogram",
binwidth = 1000,
main = "Histogram for Imputed.House.Rental.Value",
xlab = "Imputed.House.Rental.Value",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,730000))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Medical.Care.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 282.8 1099.0 7317.3 4597.2 672466.0
qplot(var_train_num$Medical.Care.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Medical.Care.Expenditure",
xlab = "Medical.Care.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,1000000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Transportation.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2434 6240 11926 13934 240000
qplot(var_train_num$Transportation.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Transportation.Expenditure",
xlab = "Transportation.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,500000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Communication.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 576 1560 4135 3998 87600
qplot(var_train_num$Communication.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Communication.Expenditure",
xlab = "Communication.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,100000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Education.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 884.5 7190.9 4120.0 396000.0
qplot(var_train_num$Education.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Education.Expenditure",
xlab = "Education.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,340000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Miscellaneous.Goods.and.Services.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18 3862 6867 12601 14151 292086
qplot(var_train_num$Miscellaneous.Goods.and.Services.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure",
xlab = "Miscellaneous.Goods.and.Services.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,320000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Special.Occasions.Expenditure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1500 5353 5000 340000
qplot(var_train_num$Special.Occasions.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Special.Occasions.Expenditure",
xlab = "Special.Occasions.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,310000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Crop.Farming.and.Gardening.expenses)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 13928 6592 1779690
qplot(var_train_num$Crop.Farming.and.Gardening.expenses,
geom="histogram",
binwidth = 100000,
main = "Histogram for Crop.Farming.and.Gardening.expenses",
xlab = "Crop.Farming.and.Gardening.expenses",
fill=I("blue"),
col=I("red"),
xlim=c(-100000,3800000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Total.Income.from.Entrepreneurial.Acitivites)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 18990 54199 65950 4798140
qplot(var_train_num$Total.Income.from.Entrepreneurial.Acitivites,
geom="histogram",
binwidth = 100000,
main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites",
xlab = "Total.Income.from.Entrepreneurial.Acitivites",
fill=I("blue"),
col=I("red"),
xlim=c(-100000,4800000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Household.Head.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 41.00 50.00 51.26 61.00 98.00
qplot(var_train_num$Household.Head.Age,
geom="histogram",
binwidth = 5,
main = "Histogram for Household.Head.Age",
xlab = "Household.Head.Age",
fill=I("blue"),
col=I("red"),
xlim=c(10,100))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Total.Number.of.Family.members)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 4.649 6.000 20.000
qplot(var_train_num$Total.Number.of.Family.members,
geom="histogram",
binwidth = 1,
main = "Histogram for Total.Number.of.Family.members",
xlab = "Total.Number.of.Family.members",
fill=I("blue"),
col=I("red"),
xlim=c(-1,23))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Total.number.of.family.members.employed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.281 2.000 8.000
qplot(var_train_num$Total.number.of.family.members.employed,
geom="histogram",
binwidth = 1,
main = "Histogram for Total.number.of.family.members.employed",
xlab = "Total.number.of.family.members.employed",
fill=I("blue"),
col=I("red"),
xlim=c(-1,10))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$House.Floor.Area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 40.00 55.64 70.00 998.00
qplot(var_train_num$House.Floor.Area,
geom="histogram",
binwidth = 25,
main = "Histogram for House.Floor.Area",
xlab = "House.Floor.Area",
fill=I("blue"),
col=I("red"),
xlim=c(-25,1000))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$House.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 17.00 20.13 27.00 150.00
qplot(var_train_num$House.Age,
geom="histogram",
binwidth = 5,
main = "Histogram for House.Age",
xlab = "House.Age",
fill=I("blue"),
col=I("red"),
xlim=c(-1,130))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.bedrooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.792 2.000 9.000
qplot(var_train_num$Number.of.bedrooms,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.bedrooms",
xlab = "Number.of.bedrooms",
fill=I("blue"),
col=I("red"),
xlim=c(-1,10))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Television)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.8621 1.0000 6.0000
qplot(var_train_num$Number.of.Television,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Television",
xlab = "Number.of.Television",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.CD.VCD.DVD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4489 1.0000 4.0000
qplot(var_train_num$Number.of.CD.VCD.DVD,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.CD.VCD.DVD",
xlab = "Number.of.CD.VCD.DVD",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Component.Stereo.set)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1636 0.0000 5.0000
qplot(var_train_num$Number.of.Component.Stereo.set,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Component.Stereo.set",
xlab = "Number.of.Component.Stereo.set",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Refrigerator.Freezer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4067 1.0000 5.0000
qplot(var_train_num$Number.of.Refrigerator.Freezer,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Refrigerator.Freezer",
xlab = "Number.of.Refrigerator.Freezer",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Washing.Machine)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3261 1.0000 3.0000
qplot(var_train_num$Number.of.Washing.Machine,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Washing.Machine",
xlab = "Number.of.Washing.Machine",
fill=I("blue"),
col=I("red"),
xlim=c(-1,5))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Airconditioner)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1291 0.0000 5.0000
qplot(var_train_num$Number.of.Airconditioner,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Airconditioner",
xlab = "Number.of.Airconditioner",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Car..Jeep..Van)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.08 0.00 5.00
qplot(var_train_num$Number.of.Car..Jeep..Van,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Car..Jeep..Van",
xlab = "Number.of.Car..Jeep..Van",
fill=I("blue"),
col=I("red"),
xlim=c(-1,6))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Landline.wireless.telephones)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06171 0.00000 4.00000
qplot(var_train_num$Number.of.Landline.wireless.telephones,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Landline.wireless.telephones",
xlab = "Number.of.Landline.wireless.telephones",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Cellular.phone)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 2.00 1.95 3.00 10.00
qplot(var_train_num$Number.of.Cellular.phone,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Cellular.phone",
xlab = "Number.of.Cellular.phone",
fill=I("blue"),
col=I("red"),
xlim=c(-1,12))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Personal.Computer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3287 0.0000 6.0000
qplot(var_train_num$Number.of.Personal.Computer,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Personal.Computer",
xlab = "Number.of.Personal.Computer",
fill=I("blue"),
col=I("red"),
xlim=c(-1,8))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Stove.with.Oven.Gas.Range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1346 0.0000 3.0000
qplot(var_train_num$Number.of.Stove.with.Oven.Gas.Range,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Stove.with.Oven.Gas.Range",
xlab = "Number.of.Stove.with.Oven.Gas.Range",
fill=I("blue"),
col=I("red"),
xlim=c(-1,4))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Motorized.Banca)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01114 0.00000 3.00000
qplot(var_train_num$Number.of.Motorized.Banca,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Motorized.Banca",
xlab = "Number.of.Motorized.Banca",
fill=I("blue"),
col=I("red"),
xlim=c(-1,5))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(var_train_num$Number.of.Motorcycle.Tricycle)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3019 1.0000 5.0000
qplot(var_train_num$Number.of.Motorcycle.Tricycle,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Motorcycle.Tricycle",
xlab = "Number.of.Motorcycle.Tricycle",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).
Puede observarse que la mayoría de las variables están sesgadas a la derecha, una característica común cuando tratamos con datos socioeconómicos. A simple vista, muy pocas tienen una distribución simétrica, como sea el caso de la distribución normal de la variable “House.Age”.
Como parte final del análisis exploratorio de datos, se muestran algunas visualizaciones interesantes donde observar el tipo de población de una familia filipina en 2017. Viendo estas gráficas, podría afirmarse que se trata de una población mayormente agraria, en la que abundan los trabajos de campo. Además, las familias de la Región NAT son aquellas que más gastos tienen.
# Profesiones más comúnes en Filipinas
by_common_jobs <- datos_occupation %>%
group_by(Household.Head.Occupation) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>%
head(20) %>% ungroup()
ggplot(data = by_common_jobs) + geom_bar(mapping = aes(x = Household.Head.Occupation, y = Total), stat = "identity") + labs(title="Trabajos más comunes en familias filipinas") + theme(axis.text.x = element_text(angle = 30, hjust = 1))
# Región y gastos
by_region_educ <- datos_occupation %>%
group_by(Region, Education.Expenditure, Housing.and.water.Expenditure) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>% ungroup()
# Para ver el boxplot es necesario transformar la variable
ggplot(by_region_educ, aes(x=Region, y=Education.Expenditure)) + geom_boxplot(color="black", fill="orange", alpha = 0.6) + scale_y_log10() + labs(title="Gasto de educación por regiones") + theme(axis.text.x = element_text(angle = 30, hjust = 1))
Una vez hecho el análisis EDA, con un mejor conocimiento de los datos disponibles, es hora de empezar a prepararlos para diseñar el modelo. El primer paso es un diagnóstico de valores faltantes, que tendremos que imputar con valores factibles.
Se recuerda que, a partir de ahora, se trabajará con el conjunto de datos train, ya que los datos test no serán utilizados hasta la última parte de este trabajo.
# ----- Detección e imputación de datos faltantes -----
# Cálculo del número total de NA en el conjunto de datos de train
length(which(is.na(datos_training)))
## [1] 1253
# Cálculo del número total de filas que contienen al menos un NA en el conjunto de datos de train
length(which(!complete.cases(datos_training)))
## [1] 1250
Existen bastantes valores NA en el conjunto, pero todos corresponden a las variables cualitativas. Se muestra gráficamente como se distribuyenlos NA en el conjunto de datos correspondiente a las variables cualitativas.
# Número de NA en el conjunto de variables cuantitativas y en el conjunto de las cualitativas
length(which(is.na(var_train_num)))
## [1] 0
length(which(is.na(var_train_cat)))
## [1] 1253
length(which(is.na(var_train_cat$Tenure.Status)))
## [1] 7
# Visualización gráfica de la distribución de NA en el conjunto de datos correspondiente a las variables cualitativas
aggr_plot<-aggr(var_train_cat
,numbers=TRUE,sortVars=TRUE,
labels=names(var_train_cat)
,cex.axis=.7,gap=3
,ylab=c('Histograma de datos faltantes','Patrones de datos faltantes'),
only.miss=TRUE)
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies
##
## Variables sorted by number of missings:
## Variable Count
## Household.Head.Class.of.Worker 0.1772857143
## Tenure.Status 0.0010000000
## Type.of.Walls 0.0005714286
## Type.of.Roof 0.0001428571
## Region 0.0000000000
## Main.Source.of.Income 0.0000000000
## Household.Head.Sex 0.0000000000
## Household.Head.Marital.Status 0.0000000000
## Household.Head.Highest.Grade.Completed 0.0000000000
## Household.Head.Job.or.Business.Indicator 0.0000000000
## Type.of.Household 0.0000000000
## Type.of.Building.House 0.0000000000
## Toilet.Facilities 0.0000000000
## Electricity 0.0000000000
## Main.Source.of.Water.Supply 0.0000000000
# Tabla de contingencias de las variables cuyos NA serán imputados
table_pre_Tenure<-prop.table(table(var_train_cat$Tenure.Status))
table_pre_Worker<-prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
table_pre_Walls<-prop.table(table(var_train_cat$Type.of.Walls))
table_pre_Roof<-prop.table(table(var_train_cat$Type.of.Roof))
# Summary de las 4 variables cuyos NA serán imputados
summary_Tenure <- summary(var_train_cat$Tenure.Status)
summary_Worker <- summary(var_train_cat$Household.Head.Class.of.Worker)
summary_Walls<-summary(var_train_cat$Type.of.Walls)
summary_Roof<-summary(var_train_cat$Type.of.Roof)
Al decidir qué método de imputación de datos faltantes utilizar, es conveniente tener en cuenta que se está trabajando tratando con variables categóricas, y que el modelo a diseñar será una regresión lineal múltiple.
Por ello, una buena opción es el método no lineal KNN (k nearest neighbors), el cual calcula la distancia del elemento nuevo a cada uno de los existentes, y ordena dichas distancias de menor a mayor para ir seleccionando el grupo al que pertenece. Por lo tanto, dicho grupo será aquel que tenga una menor distacia con la mayor frecuencia.
# Imputación de los valores NA usando el método no lineal kNN (k nearest neighbors)
var_train_cat <- VIM::kNN(var_train_cat,variable='Tenure.Status',impNA=TRUE)
var_train_cat$Tenure.Status_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Household.Head.Class.of.Worker',impNA=TRUE)
var_train_cat$Household.Head.Class.of.Worker_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Type.of.Walls',impNA=TRUE)
var_train_cat$Type.of.Walls_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Type.of.Roof',impNA=TRUE)
var_train_cat$Type.of.Roof_imp<-NULL
# Comprobación de que se han eliminado todos los NA del conjunto de variables categóricas
length(which(is.na(var_train_cat)))
## [1] 0
# Calculamos las tablas de contingencia tras haber imputado los NA con kNN
table_pos_Tenure<-prop.table(table(var_train_cat$Tenure.Status))
table_pos_Worker<-prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
table_pos_Walls<-prop.table(table(var_train_cat$Type.of.Walls))
table_pos_Roof<-prop.table(table(var_train_cat$Type.of.Roof))
Finalmente, se comprueba que las proporciones no se han visto afectadas por la imputación.
# Comprobación de que las proporciones no se han visto afectadas por la imputación
porc_dif_Tenure <- (table_pos_Tenure*100)-(table_pre_Tenure*100)
porc_dif_Worker <- (table_pos_Worker*100)-(table_pre_Worker*100)
porc_dif_Walls <- (table_pos_Walls*100)-(table_pre_Walls*100)
porc_dif_Roof <- (table_pos_Roof*100)-(table_pre_Roof*100)
Para utilizar un modelo de regresión lineal múltiple, es muy conveniente que se cumplan las siquientes condiciones:
Por lo tanto, para poder aplicar un modelo de regresión multiple a las variables numéricas del presente trabajo, es necesario plantear una transformación para que se acerquen lo más posible a una distribución normal. Se recuerda que durante el análisis eda, se constató que la mayoría de las variables mostraban un sesgo a la derecha, lo cual podría estropear el diseño del modelo (distribuciones no normales).
# Normalización de las variables numericas usando scale (media 0 y desviación típica 1)
var_train_num_NORM <- (scale(var_train_num,center=T,scale=T))
# Histogramas de las variables cuantitativas normalizadas
summary(var_train_num_NORM[1:7000,1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.8689 -0.5357 -0.3107 0.0000 0.1769 17.7158
qplot(var_train_num_NORM[1:7000,1],
geom="histogram",
main = "Histogram for Total Household Income",
xlab = "Total Household Income",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5470 -0.6660 -0.2381 0.0000 0.4079 12.3802
qplot(var_train_num_NORM[1:7000,2],
geom="histogram",
main = "Histogram for Total Food Expenditure",
xlab = "Total Food Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,3])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9653 -0.6541 -0.1402 0.0000 0.4895 25.2303
qplot(var_train_num_NORM[1:7000,3],
geom="histogram",
main = "Histogram for Bread.and.Cereals.Expenditure",
xlab = "Bread.and.Cereals.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,4])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6333 -0.6442 -0.1397 0.0000 0.5339 29.5476
qplot(var_train_num_NORM[1:7000,4],
geom="histogram",
main = "Histogram for Total Rice Expenditure",
xlab = "Total Rice Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,5])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9792 -0.6718 -0.2917 0.0000 0.3342 23.1227
qplot(var_train_num_NORM[1:7000,5],
geom="histogram",
main = "Histogram for Meat.Expenditure",
xlab = "Meat.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,6])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.3673 -0.6515 -0.2398 0.0000 0.3551 9.2801
qplot(var_train_num_NORM[1:7000,6],
geom="histogram",
main = "Histogram for Total.Fish.and..marine.products.Expenditure",
xlab = "Total.Fish.and..marine.products.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,7])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9286 -0.5592 -0.2607 0.0000 0.2080 29.2193
qplot(var_train_num_NORM[1:7000,7],
geom="histogram",
main = "Histogram for Fruit.Expenditure",
xlab = "Fruit.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,8])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5183 -0.6563 -0.2016 0.0000 0.3999 13.4106
qplot(var_train_num_NORM[1:7000,8],
geom="histogram",
main = "Histogram for Vegetables.Expenditure",
xlab = "Vegetables.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,9])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6627 -0.5792 -0.3535 0.0000 0.2010 16.7806
qplot(var_train_num_NORM[1:7000,9],
geom="histogram",
main = "Histogram for Restaurant.and.hotels.Expenditure",
xlab = "Restaurant.and.hotels.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,10])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.50105 -0.50105 -0.37471 0.00000 0.09402 16.99393
qplot(var_train_num_NORM[1:7000,10],
geom="histogram",
main = "Histogram for Alcoholic.Beverages.Expenditure",
xlab = "Alcoholic.Beverages.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,11])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5560 -0.5560 -0.5083 0.0000 0.2063 23.3245
qplot(var_train_num_NORM[1:7000,11],
geom="histogram",
main = "Histogram for Tobacco.Expenditure",
xlab = "Tobacco.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,12])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7271 -0.5265 -0.3213 0.0000 0.1164 15.7642
qplot(var_train_num_NORM[1:7000,12],
geom="histogram",
main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure",
xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,13])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6748 -0.4746 -0.2894 0.0000 0.1539 23.4559
qplot(var_train_num_NORM[1:7000,13],
geom="histogram",
main = "Histogram for Housing.and.water.Expenditure",
xlab = "Housing.and.water.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,14])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.53487 -0.38268 -0.26094 0.00000 0.07387 29.90183
qplot(var_train_num_NORM[1:7000,14],
geom="histogram",
main = "Histogram for Imputed.House.Rental.Value",
xlab = "Imputed.House.Rental.Value",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,15])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.26125 -0.25116 -0.22201 0.00000 -0.09712 23.74789
qplot(var_train_num_NORM[1:7000,15],
geom="histogram",
main = "Histogram for Medical.Care.Expenditure",
xlab = "Medical.Care.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,16])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6650 -0.5292 -0.3170 0.0000 0.1119 12.7173
qplot(var_train_num_NORM[1:7000,16],
geom="histogram",
main = "Histogram for Transportation.Expenditure",
xlab = "Transportation.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,17])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.57729 -0.49687 -0.35948 0.00000 -0.01915 11.65362
qplot(var_train_num_NORM[1:7000,17],
geom="histogram",
main = "Histogram for Communication.Expenditure",
xlab = "Communication.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,18])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3836 -0.3836 -0.3364 0.0000 -0.1638 20.7404
qplot(var_train_num_NORM[1:7000,18],
geom="histogram",
main = "Histogram for Education.Expenditure",
xlab = "Education.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,19])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.76389 -0.53049 -0.34809 0.00000 0.09411 16.96726
qplot(var_train_num_NORM[1:7000,19],
geom="histogram",
main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure",
xlab = "Miscellaneous.Goods.and.Services.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,20])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.38679 -0.38679 -0.27841 0.00000 -0.02553 24.17866
qplot(var_train_num_NORM[1:7000,20],
geom="histogram",
main = "Histogram for Special.Occasions.Expenditure",
xlab = "Special.Occasions.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,21])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2998 -0.2998 -0.2998 0.0000 -0.1579 38.0010
qplot(var_train_num_NORM[1:7000,21],
geom="histogram",
main = "Histogram for Crop.Farming.and.Gardening.expenses",
xlab = "Crop.Farming.and.Gardening.expenses",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,22])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.39734 -0.39734 -0.25812 0.00000 0.08615 34.77798
qplot(var_train_num_NORM[1:7000,22],
geom="histogram",
main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites",
xlab = "Total.Income.from.Entrepreneurial.Acitivites",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,23])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.55382 -0.72277 -0.08895 0.00000 0.68573 3.29145
qplot(var_train_num_NORM[1:7000,23],
geom="histogram",
main = "Histogram for Household.Head.Age",
xlab = "Household.Head.Age",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,24])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5950 -0.7207 -0.2836 0.0000 0.5907 6.7108
qplot(var_train_num_NORM[1:7000,24],
geom="histogram",
main = "Histogram for Total.Number.of.Family.members",
xlab = "Total.Number.of.Family.members",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,25])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0818 -1.0818 -0.2370 0.0000 0.6078 5.6765
qplot(var_train_num_NORM[1:7000,25],
geom="histogram",
main = "Histogram for Total.number.of.family.members.employed",
xlab = "Total.number.of.family.members.employed",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,26])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.8974 -0.5430 -0.2772 0.0000 0.2545 16.6989
qplot(var_train_num_NORM[1:7000,26],
geom="histogram",
main = "Histogram for House.Floor.Area",
xlab = "House.Floor.Area",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,27])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4055 -0.7074 -0.2187 0.0000 0.4794 9.0662
qplot(var_train_num_NORM[1:7000,27],
geom="histogram",
main = "Histogram for House.Age",
xlab = "House.Age",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,28])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6218 -0.7170 0.1878 0.0000 0.1878 6.5214
qplot(var_train_num_NORM[1:7000,28],
geom="histogram",
main = "Histogram for Number.of.bedrooms",
xlab = "Number.of.bedrooms",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,29])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.3539 0.2165 0.2165 0.0000 0.2165 8.0686
qplot(var_train_num_NORM[1:7000,29],
geom="histogram",
main = "Histogram for Number.of.Television",
xlab = "Number.of.Television",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,30])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.8054 -0.8054 -0.8054 0.0000 0.9890 6.3722
qplot(var_train_num_NORM[1:7000,30],
geom="histogram",
main = "Histogram for Number.of.CD.VCD.DVD",
xlab = "Number.of.CD.VCD.DVD",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,31])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4168 -0.4168 -0.4168 0.0000 -0.4168 12.3251
qplot(var_train_num_NORM[1:7000,31],
geom="histogram",
main = "Histogram for Number.of.Component.Stereo.set",
xlab = "Number.of.Component.Stereo.set",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,32])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.755 -0.755 -0.755 0.000 1.101 8.527
qplot(var_train_num_NORM[1:7000,32],
geom="histogram",
main = "Histogram for Number.of.Refrigerator.Freezer",
xlab = "Number.of.Refrigerator.Freezer",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,33])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6782 -0.6782 -0.6782 0.0000 1.4013 5.5605
qplot(var_train_num_NORM[1:7000,33],
geom="histogram",
main = "Histogram for Number.of.Washing.Machine",
xlab = "Number.of.Washing.Machine",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,34])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2882 -0.2882 -0.2882 0.0000 -0.2882 10.8704
qplot(var_train_num_NORM[1:7000,34],
geom="histogram",
main = "Histogram for Number.of.Airconditioner",
xlab = "Number.of.Airconditioner",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,35])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.235 -0.235 -0.235 0.000 -0.235 14.452
qplot(var_train_num_NORM[1:7000,35],
geom="histogram",
main = "Histogram for Number.of.Car..Jeep..Van",
xlab = "Number.of.Car..Jeep..Van",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,36])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2171 -0.2171 -0.2171 0.0000 -0.2171 13.8570
qplot(var_train_num_NORM[1:7000,36],
geom="histogram",
main = "Histogram for Number.of.Landline.wireless.telephones",
xlab = "Number.of.Landline.wireless.telephones",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,37])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.24618 -0.60702 0.03214 0.00000 0.67130 5.14543
qplot(var_train_num_NORM[1:7000,37],
geom="histogram",
main = "Histogram for Number.of.Cellular.phone",
xlab = "Number.of.Cellular.phone",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,38])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4435 -0.4435 -0.4435 0.0000 -0.4435 7.6520
qplot(var_train_num_NORM[1:7000,38],
geom="histogram",
main = "Histogram for Number.of.Personal.Computer",
xlab = "Number.of.Personal.Computer",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,39])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3827 -0.3827 -0.3827 0.0000 -0.3827 8.1497
qplot(var_train_num_NORM[1:7000,39],
geom="histogram",
main = "Histogram for Number.of.Stove.with.Oven.Gas.Range",
xlab = "Number.of.Stove.with.Oven.Gas.Range",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,40])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0927 -0.0927 -0.0927 0.0000 -0.0927 24.8646
qplot(var_train_num_NORM[1:7000,40],
geom="histogram",
main = "Histogram for Number.of.Motorized.Banca",
xlab = "Number.of.Motorized.Banca",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(var_train_num_NORM[1:7000,41])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5392 -0.5392 -0.5392 0.0000 1.2472 8.3928
qplot(var_train_num_NORM[1:7000,41],
geom="histogram",
main = "Histogram for Number.of.Motorcycle.Tricycle",
xlab = "Number.of.Motorcycle.Tricycle",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
La normalización produce que algunas variables muestren una distribución normal, o casi normal. Sin embargo, muchas de ellas siguen sin adquirir dicha distribución. A continuación, se procede a aplicar otra transformación distinta a las variables: el logaritmo decimal.
# Transformación logarítmica, que produce que los valores iguales a 0 se transformen a -Inf (por la definición del logaritmo)
var_train_num_Log<- log(var_train_num)
# Se imputan los valores -Infinito a valor 0, para no entorpecer la visualización y el procesado
Log_sin_inf <- replace(var_train_num_Log,var_train_num_Log=="-Inf",0)
# Histogramas de las variables cuantitativas transformadas con el logaritmo
summary(Log_sin_inf[1:7000,1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.789 11.572 12.018 12.097 12.595 15.413
qplot(Log_sin_inf[1:7000,1],
geom="histogram",
main = "Histogram for Total Household Income",
xlab = "Total Household Income",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.744 10.848 11.203 11.200 11.576 13.487
qplot(Log_sin_inf[1:7000,2],
geom="histogram",
main = "Histogram for Total Food Expenditure",
xlab = "Total Food Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,3])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 9.721 10.052 9.997 10.348 12.753
qplot(Log_sin_inf[1:7000,3],
geom="histogram",
main = "Histogram for Bread.and.Cereals.Expenditure",
xlab = "Bread.and.Cereals.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,4])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 9.297 9.709 9.447 10.082 12.748
qplot(Log_sin_inf[1:7000,4],
geom="histogram",
main = "Histogram for Total Rice Expenditure",
xlab = "Total Rice Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,5])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 8.112 8.917 8.718 9.565 12.474
qplot(Log_sin_inf[1:7000,5],
geom="histogram",
main = "Histogram for Meat.Expenditure",
xlab = "Meat.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,6])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 8.611 9.065 8.996 9.489 11.311
qplot(Log_sin_inf[1:7000,6],
geom="histogram",
main = "Histogram for Total.Fish.and..marine.products.Expenditure",
xlab = "Total.Fish.and..marine.products.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,7])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.920 7.512 7.455 8.044 11.322
qplot(Log_sin_inf[1:7000,7],
geom="histogram",
main = "Histogram for Fruit.Expenditure",
xlab = "Fruit.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,8])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.964 8.388 8.297 8.764 10.816
qplot(Log_sin_inf[1:7000,8],
geom="histogram",
main = "Histogram for Vegetables.Expenditure",
xlab = "Vegetables.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,9])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.611 8.920 8.124 9.947 12.953
qplot(Log_sin_inf[1:7000,9],
geom="histogram",
main = "Histogram for Restaurant.and.hotels.Expenditure",
xlab = "Restaurant.and.hotels.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,10])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 5.620 4.103 7.170 10.551
qplot(Log_sin_inf[1:7000,10],
geom="histogram",
main = "Histogram for Alcoholic.Beverages.Expenditure",
xlab = "Alcoholic.Beverages.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,11])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 5.273 4.044 8.046 11.490
qplot(Log_sin_inf[1:7000,11],
geom="histogram",
main = "Histogram for Tobacco.Expenditure",
xlab = "Tobacco.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,12])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.224 7.929 7.848 8.661 11.634
qplot(Log_sin_inf[1:7000,12],
geom="histogram",
main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure",
xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,13])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.745 9.483 10.050 10.139 10.761 14.084
qplot(Log_sin_inf[1:7000,13],
geom="histogram",
main = "Histogram for Housing.and.water.Expenditure",
xlab = "Housing.and.water.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,14])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 8.700 9.287 8.979 10.086 13.998
qplot(Log_sin_inf[1:7000,14],
geom="histogram",
main = "Histogram for Imputed.House.Rental.Value",
xlab = "Imputed.House.Rental.Value",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,15])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.645 7.002 6.898 8.433 13.419
qplot(Log_sin_inf[1:7000,15],
geom="histogram",
main = "Histogram for Medical.Care.Expenditure",
xlab = "Medical.Care.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,16])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.797 8.739 8.605 9.542 12.388
qplot(Log_sin_inf[1:7000,16],
geom="histogram",
main = "Histogram for Transportation.Expenditure",
xlab = "Transportation.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,17])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.356 7.352 6.842 8.293 11.381
qplot(Log_sin_inf[1:7000,17],
geom="histogram",
main = "Histogram for Communication.Expenditure",
xlab = "Communication.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,18])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 6.785 5.505 8.324 12.889
qplot(Log_sin_inf[1:7000,18],
geom="histogram",
main = "Histogram for Education.Expenditure",
xlab = "Education.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,19])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.890 8.259 8.834 8.917 9.558 12.585
qplot(Log_sin_inf[1:7000,19],
geom="histogram",
main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure",
xlab = "Miscellaneous.Goods.and.Services.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,20])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 7.313 5.590 8.517 12.737
qplot(Log_sin_inf[1:7000,20],
geom="histogram",
main = "Histogram for Special.Occasions.Expenditure",
xlab = "Special.Occasions.Expenditure",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,21])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.965 8.794 14.392
qplot(Log_sin_inf[1:7000,21],
geom="histogram",
main = "Histogram for Crop.Farming.and.Gardening.expenses",
xlab = "Crop.Farming.and.Gardening.expenses",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,22])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 9.852 6.878 11.097 15.384
qplot(Log_sin_inf[1:7000,22],
geom="histogram",
main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites",
xlab = "Total.Income.from.Entrepreneurial.Acitivites",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,23])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.708 3.714 3.912 3.896 4.111 4.585
qplot(Log_sin_inf[1:7000,23],
geom="histogram",
main = "Histogram for Household.Head.Age",
xlab = "Household.Head.Age",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,24])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.099 1.386 1.401 1.792 2.996
qplot(Log_sin_inf[1:7000,24],
geom="histogram",
main = "Histogram for Total.Number.of.Family.members",
xlab = "Total.Number.of.Family.members",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,25])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3185 0.6931 2.0794
qplot(Log_sin_inf[1:7000,25],
geom="histogram",
main = "Histogram for Total.number.of.family.members.employed",
xlab = "Total.number.of.family.members.employed",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,26])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.609 3.219 3.689 3.718 4.248 6.906
qplot(Log_sin_inf[1:7000,26],
geom="histogram",
main = "Histogram for House.Floor.Area",
xlab = "House.Floor.Area",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,27])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.303 2.833 2.696 3.296 5.011
qplot(Log_sin_inf[1:7000,27],
geom="histogram",
main = "Histogram for House.Age",
xlab = "House.Age",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,28])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.6931 0.5125 0.6931 2.1972
qplot(Log_sin_inf[1:7000,28],
geom="histogram",
main = "Histogram for Number.of.bedrooms",
xlab = "Number.of.bedrooms",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,29])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06998 0.00000 1.79176
qplot(Log_sin_inf[1:7000,29],
geom="histogram",
main = "Histogram for Number.of.Television",
xlab = "Number.of.Television",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,30])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01724 0.00000 1.38629
qplot(Log_sin_inf[1:7000,30],
geom="histogram",
main = "Histogram for Number.of.CD.VCD.DVD",
xlab = "Number.of.CD.VCD.DVD",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,31])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.004224 0.000000 1.609438
qplot(Log_sin_inf[1:7000,31],
geom="histogram",
main = "Histogram for Number.of.Component.Stereo.set",
xlab = "Number.of.Component.Stereo.set",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,32])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0113 0.0000 1.6094
qplot(Log_sin_inf[1:7000,32],
geom="histogram",
main = "Histogram for Number.of.Refrigerator.Freezer",
xlab = "Number.of.Refrigerator.Freezer",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,33])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.00354 0.00000 1.09861
qplot(Log_sin_inf[1:7000,33],
geom="histogram",
main = "Histogram for Number.of.Washing.Machine",
xlab = "Number.of.Washing.Machine",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,34])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01779 0.00000 1.60944
qplot(Log_sin_inf[1:7000,34],
geom="histogram",
main = "Histogram for Number.of.Airconditioner",
xlab = "Number.of.Airconditioner",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,35])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.009669 0.000000 1.609438
qplot(Log_sin_inf[1:7000,35],
geom="histogram",
main = "Histogram for Number.of.Car..Jeep..Van",
xlab = "Number.of.Car..Jeep..Van",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,36])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.004284 0.000000 1.386294
qplot(Log_sin_inf[1:7000,36],
geom="histogram",
main = "Histogram for Number.of.Landline.wireless.telephones",
xlab = "Number.of.Landline.wireless.telephones",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,37])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.6931 0.5659 1.0986 2.3026
qplot(Log_sin_inf[1:7000,37],
geom="histogram",
main = "Histogram for Number.of.Cellular.phone",
xlab = "Number.of.Cellular.phone",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,38])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06562 0.00000 1.79176
qplot(Log_sin_inf[1:7000,38],
geom="histogram",
main = "Histogram for Number.of.Personal.Computer",
xlab = "Number.of.Personal.Computer",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,39])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.002335 0.000000 1.098612
qplot(Log_sin_inf[1:7000,39],
geom="histogram",
main = "Histogram for Number.of.Stove.with.Oven.Gas.Range",
xlab = "Number.of.Stove.with.Oven.Gas.Range",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,40])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.000908 0.000000 1.098612
qplot(Log_sin_inf[1:7000,40],
geom="histogram",
main = "Histogram for Number.of.Motorized.Banca",
xlab = "Number.of.Motorized.Banca",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Log_sin_inf[1:7000,41])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02656 0.00000 1.60944
qplot(Log_sin_inf[1:7000,41],
geom="histogram",
main = "Histogram for Number.of.Motorcycle.Tricycle",
xlab = "Number.of.Motorcycle.Tricycle",
fill=I("blue"),
col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Con esta transformación, se consiguen resultados mejores con respecto a la normalización. Muchas variables adquieren distribuciones normales o casi normales, lo que permitirá poder utilizarlas en el diseño del modelo. Sin embargo, hay otras que supondrían un problema, pues tienen distribuciones muy asimétricas.
Algunas variables presentan un porcentaje muy alto de valores iguales a 0, lo que produce un polo en el extremo izquierdo de la distribución. Esto es debido a que la población filipina cuenta con un gran número de familias que viven en condiciones extremas de pobreza (aunque desde 2018, su situación económica está mejorando considerablemente).
En el siguiente apartado, se tendrá en cuenta lo analizado, para descartar aquellas variables cuya distribución no encaje con los requisitos, y partiendo del conjunto de datos transformados logarítmicamente.
Recordamos las condiciones óptimas de cualquier modelo de regresión lineal múltiple:
Por lo tanto, y a la vista del apartado anterior, se partirá del conjunto de variables transformadas logarítmicamente. Además, serán descartadas aquellas variables con distribuciones claramente asimétricas. Después, mediante un análisis de la correlación entre pares de las variables matrices, se rechazarán las variables altamente correladas entre sí a la hora de diseñar el modelo de regresión lineal múltiple.
# ----- Descarte de variables que no tienen distribuciones normales/simétricas -----
Log_reduced <- Log_sin_inf%>%select(
-Restaurant.and.hotels.Expenditure,
-Alcoholic.Beverages.Expenditure,
-Tobacco.Expenditure,
-Imputed.House.Rental.Value,
-Medical.Care.Expenditure,
-Communication.Expenditure,
-Education.Expenditure,
-Total.number.of.family.members.employed,
-Special.Occasions.Expenditure,
-Crop.Farming.and.Gardening.expenses,
-Total.Income.from.Entrepreneurial.Acitivites,
-Number.of.bedrooms,
-Number.of.Television,
-Number.of.CD.VCD.DVD,
-Number.of.Component.Stereo.set,
-Number.of.Refrigerator.Freezer,
-Number.of.Washing.Machine,
-Number.of.Airconditioner,
-Number.of.Car..Jeep..Van,
-Number.of.Landline.wireless.telephones,
-Number.of.Cellular.phone,
-Number.of.Personal.Computer,
-Number.of.Stove.with.Oven.Gas.Range,
-Number.of.Motorized.Banca,
-Number.of.Motorcycle.Tricycle)
# Cálculo de la matriz de correlaciones cruzadas
cor_matrix_log_reduced <- round(cor(Log_reduced),4)
#----- Mapa de calor de la matriz de correlaciones cruzadas----------
mapa_corr <- melt(cor_matrix_log_reduced)
ggplot(data = mapa_corr, aes(x =X1, y =X2, fill =value)) + geom_tile() + theme(axis.text.x = element_text(angle = 60, vjust= 1, size = 6, hjust = 1)) + theme(axis.text.y = element_text( vjust= 1, size = 5, hjust = 1))
Es conveniente evitar variables altamente correlacionadas entre sí, descartando de cada par la que más correlada esté con todas las demás. En el análisis no se incluirá la variable a predecir (“Total.Household.Income”), pues en ese caso la alta correlación sí es interesante.
# ----- Selección de variables ----- #
# Subconjunto sin la variable "income" a predecir
sin_income_log <- Log_reduced[,c(2:length(Log_reduced))]
# Descarte de variables altamente correlacionadas (findCorrelation)
index_log<-findCorrelation(cor(sin_income_log),cutoff =.5,verbose = TRUE,exact = TRUE)
## Compare row 1 and column 11 with corr 0.757
## Means: 0.488 vs 0.309 so flagging column 1
## Compare row 11 and column 4 with corr 0.546
## Means: 0.399 vs 0.283 so flagging column 11
## Compare row 4 and column 5 with corr 0.572
## Means: 0.375 vs 0.267 so flagging column 4
## Compare row 2 and column 5 with corr 0.549
## Means: 0.342 vs 0.248 so flagging column 2
## Compare row 9 and column 10 with corr 0.551
## Means: 0.314 vs 0.23 so flagging column 9
## Compare row 5 and column 7 with corr 0.696
## Means: 0.293 vs 0.21 so flagging column 5
## All correlations <= 0.5
sin_income_log <- sin_income_log%>%select(-index_log)
# Con el nuevo conjunto de variables, se calcula la matriz de correlación
new_var_train_log<-cbind(Total.Household.Income=Log_reduced[,1],sin_income_log)
cor_mat_log<-cor(new_var_train_log)
cor_mat_log<-cor_mat_log[,order(cor_mat_log[1,],decreasing = T)]
ggcorrplot(t(cor_mat_log), method = "circle") # Representación gráfica del mapa de calor
# Se escogerán las que tengan una correlación > de 0.5 con respecto al "Total.Household.Income"
Variables_ordenadas<-data.frame(t(cor_mat_log)[,'Total.Household.Income']) # Es para quedarse con la columna ordenada
colnames(Variables_ordenadas)<-'Coef. Corr'
View(Variables_ordenadas)
# Se realiza la regresión lineal múltiple con las variables cuyo valor de correlación cruzada es superior a 0.5
RLM<-lm(Total.Household.Income~Transportation.Expenditure
+Clothing..Footwear.and.Other.Wear.Expenditure
+Fruit.Expenditure
,data=new_var_train_log)
# Cálculo de residuos del modelo
residuos <- rstandard(RLM)
# Ajuste de valores de residuos - comprobación de normalidad
valores.ajustados <- fitted(RLM)
# Verificación de la no relación lineal entre valores predichos y residuos
plot(valores.ajustados, residuos)
# Valores de los betas estimados en la regresión lineal múltiple
summary(RLM)
##
## Call:
## lm(formula = Total.Household.Income ~ Transportation.Expenditure +
## Clothing..Footwear.and.Other.Wear.Expenditure + Fruit.Expenditure,
## data = new_var_train_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4600 -0.3315 -0.0430 0.2857 4.1229
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.460310 0.051031 146.19
## Transportation.Expenditure 0.228790 0.004760 48.06
## Clothing..Footwear.and.Other.Wear.Expenditure 0.145010 0.005051 28.71
## Fruit.Expenditure 0.205250 0.007118 28.84
## Pr(>|t|)
## (Intercept) <2e-16 ***
## Transportation.Expenditure <2e-16 ***
## Clothing..Footwear.and.Other.Wear.Expenditure <2e-16 ***
## Fruit.Expenditure <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4992 on 6996 degrees of freedom
## Multiple R-squared: 0.5792, Adjusted R-squared: 0.579
## F-statistic: 3210 on 3 and 6996 DF, p-value: < 2.2e-16
Para comprobar que el resultado es correcto, se observa en la siguiente gráfica si los residuos siguen una distribución normal. Para ello, se representa el gráfico Q-Q que compara los cuantiles teóricos de una normal con los calculados. Cuantos más puntos caigan en la recta, mejor.
qqnorm(residuos)
qqline(residuos)
# Del conjunto de test, se seleccionan las variables adecuadas
pre_datos_testing<-datos_testing%>%select(Transportation.Expenditure
,Clothing..Footwear.and.Other.Wear.Expenditure
,Fruit.Expenditure)
# Es necesario transformar logarítmicamente el conjunto de test antes de usarlo para validar, pues el conjunto de train estaba transformado logarítmicamente
pre_datos_testing<-log(pre_datos_testing)
# Quitamos los valores -Inf transformandolos a 0
pre_datos_testing <- replace(pre_datos_testing,pre_datos_testing=="-Inf",0)
# Predicción con el modelo de RLM calculado
ic <- predict(RLM,pre_datos_testing)
# Se obtienen, en un vector, los valores reales para compararlos con los predichos. Para ello, se calculan sus residuos
Valores_reales<-log(datos_testing$Total.Household.Income)
Valores_predichos<-ic
# Calculamos los residuos
residuos<-Valores_reales-Valores_predichos
# Verificación de la no relación lineal entre valores predichos y residuos
plot(Valores_predichos,residuos)
Para comprobar que el resultado es correcto, se observa en la siguiente gráfica si los residuos siguen una distribución normal. Para ello, se representa el gráfico Q-Q que compara los cuantiles teóricos de una normal con los calculados. Cuantos más puntos caigan en la recta, mejor.
qqnorm(residuos)
qqline(residuos)